Welcome reprint, Reprint annotated Source:
http://blog.csdn.net/neighborhoodguo/article/details/47449257
This is still a lecture of the guests, speaking about parallel calculation. As the saying goes: "Three Stooges, top a Zhuge Liang". Ha ha
Because the processing power of a computer or a processor is limited, parallel computing can greatly improve the computational speed and save time for debugging. While our nn is so complex and sometimes very large, it is necessary to use parallel computing.
This talk, mainly divided into five parts: 1. Efficient formulations 2.CPUs and GPUs 3.Parallelism 4.Asynchronous SGD 5.Easy implementations and current
Efficient formulationsstructured VS unstructured computation
Structured graph means that the connections between the units are very orderly, like CNN.
The advantage of this method of representation is that the cache is used continuously, is easy to load, and uses less memory. Disadvantage is that flexibility is not good
And another is unstructured graph.
The advantage is stronger expression, but the use of the cache is not continuous, not easy to load, high memory usage (and the previous confrontation)
Our goal is to make the expression more structured without compromising performance.
Block Operations and BLAS
One of the simplest examples of Block operations is matrix multiplication and addition, which is similar to packing similar operations into a whole block and then entering into batch calculations.
Blas:basic Linear Algebra Subroutines is a very advanced parallel computing tool, and other great parallel computing tools are also recommended in class.
Batching
is said before the batching gradient descent will not repeat.
CPUs and GPUs
The class lecturer said the CPU and GPU had reached peak performance.
The size of the memory is very limited, CPU and GPU communication is very slow is a bottleneck.
Less CPU cores each core operation is faster
GPU more cores each core operation is slow
But the GPU has a number of advantages, as a whole GPU operation is faster than the CPU
At first glance it looks like it's better to use the GPU completely.
Because of the communication bottleneck, the use of CPU in the calculation of a small amount of time is actually more computational advantage, in the calculation of a large number of times when using the GPU has a clear advantage.
Data parallelism
This is used to optimize the previous batching gradient descent
1. First specify a master core and then multiple worker cores, and master assigns the calculated task to each worker
2. Each worker core is then individually calculated
3. When the calculation is complete, summarize it to master, and the final result is calculated by Master.
The parallelism here are synchronous.
Model parallelism
This is to divide the model into blocks and then each module is assigned to each core calculation and then summarizes the results.
The computing power of a computer is limited, can you use more than one computer to help compute at the same time?
But the speed of Ethernet communication between computers is too slow to develop faster communication between computers.
Asynchronous SGD
The calculation method that was previously said to be synchronous requires waiting for each work core to be completed in order to summarize and calculate the results, so that part of the time is spent waiting.
In view of this, an asynchronous SGD is proposed
Assign tasks or assign them as usual, but who will upload the results to master and then update the data for each work core immediately after the master rollup is complete
Directions
There are three areas that can be improved:
1. Modify the model to minimize unstructured, increase the portion of the structured, increase the width of the model as much as possible, and reduce the depth
2. Try to make the neural unsaturated, so that the data as far as possible in the online sex area.
3. Find a better approach to optimization.
Several open-source parallelism Packages
1.BLAS
2.cpus:intel MKL, Atlas, GOTO
Gpus:cuda, OPENACC, Clblas
3.Theano, Torch
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
CS224D Lecture 15 Notes