"Editor's note" Deep convolution neural network has a wide range of application scenarios, in this paper, the deep convolution neural network deep CNNs multi-GPU model parallel and data parallel framework for the detailed sharing, through a number of worker group to achieve data parallelism, the same worker Multiple worker implementation models in a group are parallel. In the framework, the three-stage parallel pipelined I/O and CPU processing time are implemented, the model parallel engine is designed and implemented, the execution efficiency of the model parallel computation is improved, and the data storage access efficiency is solved by transmits layer. This framework significantly improves the depth convolution neural network training speed and solves the problem of training large model under the current hardware condition.
The following is the original text:
The application of depth convolution neural network (convolutional neural NX, CNNs) to image recognition attracts more and more attention in the field of research. Because the convolution neural network structure is very suitable for the training of model parallel, the model parallel + data parallel method to accelerate deep CNNs training can be expected to achieve greater results. Deep CNNs's multi-GPU model parallel and data parallel framework is a part of Tencent's deep learning platform, Tencent Deep learning platform Technology team realizes model parallel and data parallel technology accelerates Deep CNNs training, and confirms that the model splitting is effective to reduce the memory footprint on single GPU, In addition, a significant gain is obtained on the speedup index, and a larger depth convolution neural network can be trained at a faster rate to improve the accuracy of the model.
1.CNNs Model Parallel Introduction
1.1. Typical application analysis: Image recognition is a typical application example for the success of deep convolution neural networks. Figure 1 reveals a deep convolution neural network with 5 convolution layers and 3 fully connected layers, which can be applied to image classification.
Using the GPU training deep convolution neural network can achieve good results [1][2], since 2012 using the deep CNNs model to achieve breakthrough in the Imagenet Image Classification Challenge, the 2013 optimal classification results were also obtained by the deep CNNs model. Based on this, Tencent Advanced learning platform Technology team expects to introduce deep cnns to solve or optimize image classification problems and image feature extraction problems to enhance the effect in the corresponding use case scenario.
1.2. Existing system problems in the application of CNN in the field of image-related algorithms and the practice of the CNN training platform, limited by the size of the video memory on a single GPU (for example: server-purchased graphics card Tesla k20c available memory for 4.8gb,imagenet2012 papers [1 ] Using a network that consumes about 3.9GB of memory, in the experiment of adjusting parameters and network scale, it is difficult to store a larger depth convolution neural network model, so that the network with more parameters can not be trained on single GPU, and it needs to be solved by using multi-GPU model parallel technology, splitting model to multiple GPU storage and training. 。
With the expansion of training dataset and the increase of the complexity of the model, even if the GPU is accelerated, there is a serious performance shortage in the experiment, which usually takes more than 10 days to reach the convergence of the model, and can not meet the demand of training large-scale network and carrying out more experiments.
Considering the above problems, in the deep CNNs multi-GPU parallel training framework of Tencent's deep learning platform, the paper designs the model splitting method, the model parallel execution engine and the transmits Layer to optimize the memory performance, and absorbs the design experience in the data parallelism, Multi-GPU accelerated model parallel and data parallel versions are implemented.
This paper describes the model parallel and data parallel implementation of multi-GPU accelerated depth convolution neural network training system and its performance optimization, relies on the powerful cooperative parallel computing ability of multi-GPU, combined with the parallel characteristics of target deep CNNs model in training, realizes fast and efficient deep convolution neural network training.
1.3. Framework design objective Multi-GPU model parallel + data parallel expectation achieves the following goals: to fully utilize the parallel characteristics of deep CNNs model, combining the data parallelism of SGD (stochastic gradientdescent, random gradient descent), Speed up the model training process, break the memory size limit, make it possible to train more than single GPU video memory, and expect to get better model effect by training more complex network.
After the above goals are completed, the system can train the target deep CNNs model in Figure 1 more quickly. The model can be divided into different GPU to reduce the memory footprint of single GPU, which is suitable for training deeper and more parameters of the convolution neural network.
1.4. Challenge in the application of image recognition, the convolution layer of the deep convolution neural network model has large computational volume and the whole connection layer has many parameters. Therefore, how to divide computing resources and accelerate training through model parallel and data parallel two data/computation organization level is the first problem to be solved in frame design.
Image as input data, its data volume is huge, and need preprocessing process, so the disk I/O, data preprocessing work also consumes a certain time during batch training. The classical method of masking I/O time with computational time is to introduce pipelining, so how to design an effective pipelining method to conceal I/O time and CPU processing time so as to make the whole time consuming depends on actual GPU training times is an important problem.
Model parallelism is a parallel method to execute the computation of a complete deep CNNs network, and it is a key step to realize the model parallelism by combining parallel resources to reasonably dispatch the parallel parts of the model to achieve the model parallel acceleration effect.
Multi-GPU systems allow a GPU to access other GPU device memory (that is, memory) while kernel computing, but because remote device storage access is much slower than local storage access via UVA (Unified virtual Address) technology. The actual performance is poor. Therefore, when accessing data across the GPU, it is necessary to focus on how to efficiently use the data copy between devices to localize all computing data.
2. System overview How to model parallelism?
Model parallelism is: the appropriate split model to different computing units to use the task of parallelism to achieve the entire model in the process of parallel effect.
As shown in Figure 2, the difference between the training from single GPU to the multi-GPU model is revealed: In the case of single GPU training, the model is not split, the whole model is stored on the GPU memory, and the model is split into multiple GPU storage in the parallel scenario. Therefore, in the training process, each GPU is actually only part of the training model, through the execution of engine scheduling in a workergroup to complete the training of the entire model.
Multi-GPU parallel systems are functionally divided into training data dispatcher for reading and distributing data and GPU Worker for model parallel training, as shown in Figure 3. The training data reads from the disk file to the CPU main memory and then copies to the GPU memory, therefore the design when each worker calculates each batch data, reads and distributes the next batch data from the file by the training data dispatcher, achieves the computation time to conceal i/ o Time design goals.
3. The parallel acceleration of the training data processing is based on mini-batch training, and the existing technology scheme reads and processes 1 batch data from the data file each time, and the next batch is preprocessed by the CPU when calculating a batch on the GPU when the depth convolution neural network is trained. However, with the increase of the number of image pixels of the training set, the reading and processing time increases, because the use of multi-GPU technology to accelerate the single batch calculation time, data processing performance problems, the need to reduce data processing, so that the final acceleration effect depends on the calculation.
As shown in Figure 4, overall, in the depth convolution neural network training process is always in a three-stage parallel pipeline: The calculation of this batch data-processing the next batch data-read the next batch data.
4.GPU worker: Model parallel bearer data parallel to divide the worker group as the basic organization, the model parallel to the worker group in the Division of workers as the basic organization, parallel training scheduling resources from the CPU thread, The computing resource is derived from the GPU card. Since GPU cards are generally considered to be an accelerator card or coprocessor card, they must be called in the context of a CPU-based host, so it is possible to use 1 CPU threads to bind 1 GPU cards to play the parallel performance of multiple GPU participation in computation.
In a real-world production environment, the hardware architecture for installing a multi-GPU server is shown in Figure 5, and the example reveals a hardware configuration for a 8 GPU node server, with each two GPU Slot connected to a GPU-specific PCI slot and the GPU Slot via PCIe switch 0,1,2,3 is connected to a CPU, the GPU Slot 4,5,6,7 is connected to another CPU, and the two CPUs are connected via IOH (Input Output Hub).