I've been focusing on CNN implementations for a while, looking at Caffe's code and Convnet2 's code. At present, the content of the single-machine multi-card is more interested, so pay special attention to Convnet2 about MULTI-GPU support.
where Cuda-convnet2 's project address is published in: Google Code:cuda-convnet2
A more important paper on MULTI-GPU is: one weird trick for parallelizing convolutional neural Networks
This article will also give an analysis of this article.
1. IntroductionThis paper introduces a parallelization method of SGD in the training process of convolutional neural network.
two variants
- model parallelism: different patrs of different workers training models are suitable for the more abundant calculation of neuron activity.
- data parallelism: different workers training different data cases, more suitable for weight matrix more calculation.
2. ObserveThe modern convolutional neural network is composed of two layers, which have different properties and properties:
- convolutional layer , which occupies 90% to 95% of the calculation, 5% of the parameters, but the results have a great ability to express.
- The fully connected layer occupies 5% to 10% of the calculated amount, 95% of the parameters, but has a relatively small representation of the result.
Comprehensive: The convolution layer calculation is large, the required parameter coefficient w is less, the total connection layer calculation is small, the required parameter coefficient w is many. Therefore, for the convolution layer to be suitable for data parallelism, it is appropriate to use model parallelism for the full-join layer.
3. Recommended Algorithms
forward Propagation
- Each worker in K workers provides a different 128 examples data batch, which means that each worker is not the same.
- Each worker calculates a convolution layer on its data batch. The convolution layer of each worker is executed in sequence.
- The calculation of the fully connected layer is divided into the following three ways:
- (a) Each worker passes the convolution layer activities of its final stage to other workers. These workers configure the 128K examples as a large batch, and then compute the full join layer on this batch.
- (b) The first worker passes the convolution layer activities of its final stage to the other workers, which computes the batch of 128 examples configured and begins the reverse pass. (parallel to this calculation, the second worker passes the activities of the convolution layer to all workers, which implements the pipelining between activities delivery and computation).
- (c) The entire workers transmits the activities of the 128/k convolution layer to the other workers, calculated in the same manner as (b).
For the above (A~C) three different types of fully connected layer implementation, the following analysis:
(a)When the 128K images is configured for each worker, all useful work must be paused. Another big batches consumes a lot of video memory, which is not desirable for devices that have a limited GPU display. On the other hand, large batches are beneficial to GPU performance.
(b)All the workers take turns to spread their activities all over the workers. The most important result of the rotational execution here is that most of the communication time can be hidden (because he can be processed in parallel with the calculation at the time of the last calculation, which is exactly K-1 communication time can be hidden). The significance of this is very significant, can achieve a part of the water, so that the communication time hidden, to achieve a good parallel effect.
(c)Similar to the programme (b). One of his advantages is that the ratio of communication to computation is constant k, which is proportional to (a) and (b), to K. This is because (a) and (b) programmes are often constrained by the output bandwidth of each worker, and programme (c) can take advantage of all the workers to accomplish this task, with a great advantage for large K.
Reverse Propagation
- The workers calculates the gradient on the full-join layer in the usual manner.
- According to the different implementations in forward propagation, there are three scenarios:
- (a) Each worker calculates the activities gradient for the entire batch of 128K examples. So each worker must pass each example gradient to the forward propagation to generate the example worker. After this, the reverse propagation of the convolution layer is obtained in the usual manner.
- (b) Each worker has calculated the activities gradient for batch examples, and then passes the gradient values to each workers associated with the batch. (The next batch can be computed at the same time as it is propagated). After K-times forward and reverse propagation, all gradient values are propagated to the convolution layer.
- (c) similar to (b). Each worker calculates the gradient of the examples batch, which 128-example batch from each worker's 128/k-examples, so in order to properly allocate each gradient, the opposite operation needs to be installed.
Weight Weight Matrix synchronizationOnce the reverse propagation is complete, workers can update the weight matrix. In the convolutional layer, the workers must also synchronize the weight matrix. The simplest way to do this is as follows:
- Each worker is assigned to the 1/k of a gradient matrix to synchronize.
- Each worker accumulates a gradient of the corresponding 1/k from the other worker.
- Each worker broadcasts a gradient of this 1/k.
For convolutional layers, this is easier to achieve because it has fewer weights coefficients.
variable Batch sizefor scenarios (b) and (c) There have been some minor changes in the operation of synchronous 128K SGD relative to Scenario (a), where (a) is the standard Forward-backward dissemination method. (b) and (c) the implementation of the K-forward and backward processes at the full-connectivity level, with each implementation of a different examples, that is to say, if we wish, we can update the coefficients of the fully connected layer after each of the local post-propagation execution processes, so that no To any additional computational overhead. Therefore, we can use batch size = 128 at the full connection layer, in the convolutional layer batch size = 128K. With this variable batch size, this is no longer a purely parallel SGD algorithm because the gradient values are no longer updated for any model that is consistent with the convolution layer, but it turns out that in practice this is not important. In practice, using a valid batch size from 128K to thousands of, a small batch size is used at the full-join layer, and the last is faster convergence to a better minimum value. Case Diagram
The forward and backward propagation of the scheme (b) is given, where K = 2, or 2 workers.
In the three-layer convolutional layer, two-way data parallel schemes are implemented. In the fully connected layer of the two-layer structure, the two-way model parallel scheme is implemented. 2 processes of data parallelism and model parallel are decomposed into 6 processes.
- Step1: Convolution layer data parallel implementation forward propagation (3 convolutional layer sequential execution, 2 worker data parallel execution)
- Step2: The shadow portion of the convolution layer transmits data to two fully connected layers, the model is implemented in parallel forward propagation (2 full-join-level sequential execution, parallel execution of the FC11 and FC12 models, parallel execution of FC21 and FC22 models)
- Step3: Full connection layer for reverse propagation and transfer of gradient data back to the convolution layer
- STEP4: Convolution layer data with Step2,worker 2 is passed to the fully connected layer for forward propagation
- Step5: With Step3, the full-connection layer to achieve reverse propagation, the gradient is returned to the worker 2 corresponding convolution layer
- STEP6: Completes the reverse propagation of the convolution layer.
The parallelization model of convolutional neural network--one weird trick for parallelizing convolutional neural Networks