Cross-platform Caffe and I/O model and parallel scenario (iv)

Source: Internet
Author: User
Tags in python
4. Caffe Multi-GPU parallel scenario 4.1 Multi-GPU Parallelism Overview

Thanks to the explosive growth of training data and the tremendous increase in computational performance, deep learning algorithms can learn the distribution of data and hierarchical feature representations to better solve the tasks of pattern analysis and classification. In the face of huge data scale and complex deep learning model, the current mainstream single GPU training mode is unable to meet the requirements of computing performance and storage space, and multi-GPU model training becomes the trend of development. This section mainly introduces the multi-GPU parallel mode and training method of deep learning system, and summarizes the multi-GPU data parallel scheme of Caffe source.

The parallel model of the deep learning field is divided into data parallelism and model parallelism, which are presented in a paper by Jeff Dean and Andrew Ng for the CPU cluster framework dsitbelief[4 for deep learning. Data parallelism refers to dividing the training data into multiple parts, each data has a model instance to train, and then the gradient of multiple model instances is merged to update the model. Model parallelism refers to dividing the model into multiple shards, each of which is in a single server, and all shards work together to train a piece of training data. Caffe enables single-machine multi-GPU data parallelism, pre-buffering batch data for each GPU via I/O modules, and then training with a synchronous random gradient descent algorithm. In data parallel training, each GPU card holds a full copy of the model, which is then calculated to calculate the gradient value, and then the parameters are exchanged.

The main task of multi-GPU data parallel training is to maintain the data consistency of the global model parameters, update the model parameters through the optimization algorithm, and finally converge the model, and have high resource utilization and low latency. In this process, the bottleneck that affects the speed of parallel training is parameter exchange. The parameter exchange process logically, the gradient collection phase will add all the gradient values, then apply to the current model to update the parameters to obtain a new model, and finally in the model distribution phase of the new model is issued to all GPU cards. After using data parallelism, the key problem is that the parameter Exchange process introduces extra time consumption and slows down the parallel performance, which makes the acceleration ratio difficult to improve compared to the capacity is training process. The overall performance of the system can be improved by carefully designing the topology of multi-GPU parameter switching. The Caffe uses a tree-shaped topological structure for the normalization of gradients and the distribution of model parameters. In order to ensure the data consistency of the model parameters, all GPUs train a batch of training data at the same time, then wait synchronously after completion, then Exchange parameters simultaneously.

From the above, the multi-GPU parallel training system is divided into two parts: data reading and distributing and data parallel training. This article will analyze in detail the design and implementation of the Caffe Parallel training scheme in the next 4.2 and 4.3 sections, and present the advantages and disadvantages of the current Caffe multi-GPU parallel scheme in section 4.4. 4.2 data reading and distributing data parallelism means slicing the training data and using multiple model instances to train the data of multiple shards in parallel. The training data for multiple shards is read from the disk file to the CPU main memory and then copied to the GPU memory, so the design of the Caffe I/O module reads and distributes the next batch data from the file when each GPU calculates each batch data to achieve the goal of masking I/O time with compute time. This section first describes Caffe's organization management of multiple GPUs, and then illustrates the support of Caffe I/O modules for parallel training of multi-GPU data, based on the 3.3 subsections.

In Caffe's multi-GPU parallel training scenario, the master process opens a thread for each GPU, completing the Solver->net->layer model initialization, respectively, to have a copy of the model. The Solver model of the main process is called the root solver (Root_solver), which is the root node of the GPU tree topology, which is used to merge gradients and distribute new parameters during parameter exchange, preserving the data consistency of the model parameters. The Solver Solver in Caffe is responsible for optimizing the model, which first initializes the network net and each layer according to the model parameters, calls the forward-> call backward-> update weights after the model initialization is completed, and iteratively optimizes the model. To facilitate the management of multiple Gpu,caffe, a thread-independent global manager is designed, including GPU device information, model copy number Solver_count. The global manager saves the underlying configuration information for the system, has thread independence, and provides different results when the process and thread access it.

The Caffe layer class has a SHAREINPARALLEL member variable that declares whether the layer is allowed to be shared by multiple models while data is parallel. Because Caffe uses a data-parallel scheme, the model training on each GPU is independent of each other, so the layers involved in the training model cannot be shared by default. The data entry layer DataLayer is an exception and can also be set to shared, ensuring that solver on each GPU sequentially accesses the data. When the data entry layer DataLayer is set to unshared, each solver has a private data entry layer DataLayer, and the class DataLayer has a DataReader object reader_. Reader_ creates a double-blocking queue (Queuepair) type of buffer when instantiated, which is the product of the number of pre-buffered batch and batch size. The DataReader class contains the body class, which is actually the data pre-read thread for the first level I/O, and each body object controls the reading of one data source. When multiple DataLayer read the same database, the corresponding number of buffers is instantiated, but only one body object is instantiated, which prevents multiple training tasks from reading a single database at the same time. This is a single producer multi-buffer producer-consumer model, the schematic diagram shown in Figure 4-1. During the first stage of the production process, the data pre-read thread reads the database sequentially, iterating over the data into multiple datalayer buffers.

Figure 4-1 Single producer multi-buffer I/O

When the data layer DataLayer is set to shared, all solver of the thread share the same data entry layer, at which time there is only one first level buffer, which belongs to the single-producer single-buffer model, as shown in Figure 4-2. When a single-producer single buffer is used, multiple Solver secondary data prefetching threads block the same buffer in parallel, increasing the number of pre-buffered batch.

Figure 4-2 Single producer single buffer

Caffe multiple GPUs are trained in the same data set, the data entry layer DataLayer can only have one thread read the database, regardless of whether or not it is allowed to be shared. In contrast, each model has a separate data input layer with a higher degree of parallelism and better load balancing. 4.3 data parallel training

In the process of data parallel training, each GPU holds a copy of the model, and multiple GPUs train multiple sets of mini-batch data at the same time, and at the end of a round of mini-batch training, parameter exchange is required synchronously. The parameter Exchange in data parallelism requires that each model copy be merged with the gradient obtained in this mini-batch training to update the model, and then the latest model is pushed to each data parallel unit for the next round of calculations. How to solve the parameter exchange bottleneck to improve the performance is the most serious problem in the design of parallel methods.

The most optimized parameter switching solution should have the following characteristics: minimizing total traffic, minimizing the number of switching cycles used, full PCIe bus communication bandwidth per cycle, data transfer between GPUs without waiting, and scalability for scenarios with different GPU numbers. Caffe uses a tree topology to connect multiple GPUs, the default master process supervises the GPU as the root of the tree structure, and the other CPU threads manage the GPU under the root node. As shown in Figure 4-3, take 4GPU data parallelism as an example, the parameter Exchange process: 0:1,2:3 Exchange gradient, higher layer 0:2 switching gradient, No. 0 GPU merge gradient, update the model's parameter w, and then distribute the updated model parameters in the opposite direction. Currently Caffe multi-GPU parallelism does not support GPUs of different architectures, while the maximum model size is limited by the smallest device in memory.

Figure 4-3 Multi-GPU tree topology

In order to transfer GPU data more quickly, the tree structure should be built to consider whether the GPU is similar, for example, whether the two GPU can be peer-to through balls, if there is no peer DMA access, such as data through the PCIe root transfer, the effective exchange bandwidth will be greatly reduced. Figure 4-4 shows the classic PCI-E bus architecture diagram, where the GPU is divided into two regions, which are controlled by different MCH (Memory Controller hubs) and connected to different CPU data interfaces. Both the CPU and GPU in the same region are connected through PCI-E channels, but the PCI-E channel of the CPU is limited, and in engineering practice, the number of channels is increased by PCI-E switch chip (PLX), allowing multiple GPUs to be connected to one CPU at a time. CPUs in different regions are connected by QPI (QuickPath Interconnect) to ensure that data in different regions can also communicate. For GPUs in different regions, the data throughput and latency metrics for passing data between them are poor. Multi-GPU topologies are built to adhere to these principles: to allow the CPU and GPU in the same region to reduce cross-region data copy transmission, if it involves the cross-region of the GPU data transmission, to minimize the number of data transmission.


Figure 4-4 PCI-E Bus architecture diagram

When the multi-GPU tree topology is built, the data is pre-buffered to GPU memory and begins to enter multi-GPU parallel training. Caffe's Solver provides two callback functions for multi-GPU training: On_Start () and On_gradient_ready (). As shown in Figure 4-5, the On_Start function is used to copy the parameter distribution to each GPU, and the On_gradeint_ready function is used for the gradient value of the normalized inverse propagation.

Figure 4-5 Parameter Exchange process 4.4 evaluation of concurrent scenarios

The parallel scheme of Caffe source is relatively simple and original, this section will be evaluated from two aspects: function and design implementation. Caffe only supports single-machine multi-card data-level parallelism, which is more suitable in the case of medium data size and model complexity. When the depth network model is very complex, the memory of a single GPU cannot be loaded with a complete model, which requires model-level parallelism to divide the network. Because the Caffe does not support multi-machine multi-card distributed parallel, it is difficult to deal with the PB level training data in the actual production environment.

From the aspect of design implementation, it is relatively primitive and inefficient for Caffe to exchange parameters synchronously with a tree-shaped topological structure. Caffe maintains data consistency for global parameters in the same way that synchronous update parameters are used, each batch training waits for all GPU computations to end before it is normalized, parallel speeds are limited by the slowest GPU, and synchronous wait times are longer. Second, after each merge cycle, the tree topology causes half of the GPU to no longer participate in the subsequent merge process, idle its computing power and the communication bandwidth on the node it resides on. The scalability of the tree topology is not good enough, and when the number of GPUs is odd, building a tree structure Caffe error. In addition, because the C language source multithreading is not valid in Python, the Caffe Python interface cannot be trained using GPU parallelism. A better solution would be to remove the parallel scheme inside the Caffe and write the multithreading outside.

In summary, Caffe's parallel training scheme is relatively primitive and inefficient, it can completely remove the source parallel scheme and study the multi-machine multi-card distributed caffe. The current mainstream large-scale distributed machine learning platform uses the parameter server (parameterserver) scheme to maintain global parameters, uses asynchronous communication and flexible data consistency model, and has the advantages of low data synchronization delay, strong scalability and fault tolerance, etc. This article will introduce the principle and implementation of the parameter server in the 5th chapter. Reference

Jeffrey Dean, Greg S. Corrado, Rajat Monga, et al, and Andrew Y Ng () Large scale distributed deep Networks. In advances in Neural information processing (NIPS), MIT Press, Cambridge, MA.

From scratch Cottage Caffe BA: IO system (ii), http://www.cnblogs.com/neopenx/p/5259197.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.