Introduction: This paper introduces Baidu based on spark heterogeneous distributed depth learning system, combining spark and depth learning platform paddle to solve the data access problem between paddle and business logic, on the basis of using GPU and FPGA heterogeneous computing to enhance the data processing capability of each machine, Use yarn to allocate heterogeneous resources, support multi-tenancy, and make resources more efficient.
Deep neural network technology has made great breakthroughs in recent years, especially in voice and image recognition applications, has been proven to be able to use a lot of business. How to carry out deep learning programs in a large scale and distributed way to better support different lines of business is a top priority. In the past two years, Baidu in-depth learning laboratory under the leadership of Xuwei developed a distributed depth learning platform Paddle (Parallel asynchronous distributed Deep), well meet a lot of business needs. However, because paddle is an independent depth learning platform and cannot be well integrated with other business logic, the data path between paddle and other business logic becomes the bottleneck of performance. In order to allow more business use of in-depth learning technology, we developed the spark on paddle platform, so that paddle into a Baidu spark ecosystem of a functional module. After the first version of the completion, we found that CPU computing capacity has not been able to meet the huge data requirements of Baidu, so we spark on the paddle on the basis of increased support for heterogeneous, the full use of the GPU and FPGA resources to accelerate the paddle on the job. the design of paddle of deep learning system
Paddle is a mature distributed depth learning platform, widely used in Baidu's image recognition, natural language understanding, voice, unmanned vehicle and other fields, its main features are highly optimized training algorithm, support for multiple GPU/CPU training, high training efficiency, a unique optimization of sparse features.
The existing depth learning platform, usually through a stand-alone way of training, such as open source Caffe platform is also through a single card of the way to train. However, when the data or model scale up, to improve the training efficiency, it is necessary to carry out distributed training, there are two methods of data parallelism and model parallelism.
Data parallelism is the most widely used parallel method in distributed depth learning. The so-called data parallelism, because the training data scale is very large, need to divide the data, the model distributed to n machine training. But because the final training is a model, and each machine can only be allocated to a subset of data, training synchronization and convergence must be guaranteed. The most classic approach is the method of using a parameter server (Parameter server) mentioned in the Parameter server for distributed Machine Learning. The idea is to use the method of model parameter service to synchronize the update of parameters, and each parameter server is responsible for synchronizing only part of the public parameters. For instance, if Model M is distributed to n machines, each machine takes part of the data, assuming that the training parameter set is W, each machine is trained locally, assuming that they initialize the parameters, according to each machine can calculate the corresponding cost function gradient, Generally according to the way of the reverse transmission of the stand-alone neural network, each layer can be gradient to get the correction value of the parameter, so that the parameter becomes because it is a multiple machine, each node to the parameters of the correction is different, there will be more than one step to the respective parameters of the correction to the parameter server, by it unified decision-making the next training cycle of corrections, Then everyone's training model will be unified.
Figure 1 Data parallelism
Figure 1 shows a parallel deployment architecture for deep learning data. It is generally divided into the following steps: Training data preprocessing, data shards, each machine get the same model definition, and unified initialization parameters; For each training cycle, each machine calculates its own gradient, and the gradient correction is pushed to the parameter server, and the parameter server is uniformly computed, and push the parameters of the next iteration to the local training machine; Keep looping until the model converges.
The update algorithm for the parameter server is also divided into synchronous and asynchronous. Because a strict synchronization method allows the local trainer to update the parameters synchronously in each training iteration, so that the entire training is slowed down in the case of slow nodes. The idea of asynchronous parameter update is to make the parameter synchronization frequency longer, so that the local trainer to iterate several rounds after the parameter synchronization, which has both advantages and disadvantages, the advantage is that slow node on this training less impact, the disadvantage is that each model training may waste the training cycle, Because the amount of correction after synchronization may differ greatly from that of the local trainer. The problem of the synchronization frequency and the asynchronous convergence is the research direction.
As shown in Figure 2, the model parallel method can be considered as model parallel, but the model parallel communication overhead and synchronous consumption exceed the data parallelism, and the efficiency may not be high in the data parallelism.
Figure 2 Model parallelism
The design of paddle mainly adopts the way of single machine to do the model parallel and multiple machines to do the data parallelism, so as to achieve the distributed training of large-scale data in the scale of billion-level model. the pain point of combining paddle with business logic
Paddle is an independent, in-depth learning platform that does not support the need to access data from other platforms. Researchers usually have to wait for the previous phase of the work to produce paddle input data, the data first into the HDFS, and then read the paddle cluster of local memory and hard disk, and other data ready to use paddle to train the model. After the model has been trained, the model is HDFS, allowing the next business logic to be read. This process not only takes a long time, it becomes the bottleneck of the whole calculation process, and it is repetitive and tedious work, which affects the popularization of the paddle platform, so that many teams in need cannot use the depth learning technology.
To solve this problem, we designed the spark on paddle architecture, coupled spark and paddle to make paddle a module of spark. As shown in Figure 3, model training can be integrated with front-end functions, such as feature extraction through RDD data transfer, without HDFS data diversion. Thus, the data path between paddle and business logic is no longer a performance bottleneck.
Figure 3 general business logic based on Baidu Spark
Spark on Paddle Architecture version 1.0
Spark is a large data-processing platform that has risen rapidly in recent years, not only because its computational models are much more efficient than the traditional Hadoop mapreduce, but also because of the very strong ecosystem it brings. High-level applications based on the Spark computing engine, such as Spark SQL, Spark streaming, Spark Mllib, are excellent applications, several times better than traditional applications, and more stable. At the same time, the combination with Yarn/mesos makes spark more flexible in the management and distribution of computing resources.
Spark inside Baidu has been widely used, mainly for data processing and data analysis. However, the traditional data processing platform will have the mechanism according to the training model, the CTR prediction of advertising system is an example, for users to produce a large number of clicks and browse logs, spark can be processed and cleaned. However, for the training of large-scale models, the support of Spark Mllib is limited, especially for the support of deep learning, so it is necessary to solve the problem of supporting paddle in Spark.
For the user's application, the spark is called the drive node (Driver) and can be considered as the master node for Spark User distributed program scheduling and program flow control. The specific operations of the Spark program are distributed on the executor run above the worker node. Spark also has a very important concept called RDD, which is a distributed set of partition (partitioned) data abstraction. Spark all input and output data are RDD oriented, which not only describes the dependencies of the dataset, but also makes a logical segmentation of the data, which is usually partition to a RDD operation.
Fig. 4 Spark DNN Training operation framework
Spark DNN training running framework as shown in Figure 4, training is generally divided into the following 5 steps: DNN data preprocessing and training feature preparation
In general, this is the strength of spark, whether it is streaming data or have dropped the data are processed through spark, including data cleaning, feature preparation, and then the resulting training data with RDD output. Resource Request
The Spark training task is submitted from yarn to get the node resources for the DNN training task, for example, a training task requires 4 nodes with 4 GPU machines. Yarn will do container management of resources, regardless of CPU or GPU for yarn is a virtual resource. The following article will do a specific introduction. Training initialization
Driver distributes the model configuration according to the resources allocated by yarn. Model training resource pool, and starts the trainer and parameter servers, initializing the initial parameters of the model. Model Training
The training data is entered into the trainer interface in a Rdd way, and the training is conducted in a data parallel manner, and the training opportunity is communicated with the parameter server, the gradient exchange and the parameter synchronization are completed, and the training terminates when the training maximum iteration is achieved or the model converges. Model prediction
Models can be passed into a cluster of servers or loaded and predicted in a spark streaming manner.
In the Spark on Paddle 1.0 development process, we have verified that spark can indeed integrate ETL, training data preprocessing and depth learning training, and found that Baidu has a lot of in-depth learning needs, need to consider on the basis of the 1.0 spark on paddle platform, To achieve multi-tenancy resource management, training monitoring, training, fault tolerance and so on. Spark on Paddle Architecture version 2.0
Platform is the main goal of Spark on Paddle 2.0. It introduces more functions, including the introduction of monitoring mechanism, fault-tolerant mechanism in the training process, and adding ML decision module to make super parameter selection. The following is an analysis of the design of spark on Paddle 2.0.
As shown in Figure 5, figure 6, customers can start DNN training directly with Spark DNN driver communication, Spark DNN to start a training instance (driver training), and pass through training data, training network configuration and other information. A training example includes the overall service required for training, including a set of trainers and corresponding parameter servers. Then there is a training master (training Master) to manage the entire training process. At the same time, training master management trainer and Super parameter server lifetime and failure restart. The parameter servers and trainers periodically send heartbeat to the training master to ensure that it works properly.
Figure 5 Spark on Paddle 2.0 overall architecture
Figure 6 Spark on Paddle 2.0 Training instance Architecture
Monitoring mechanism in the course of training
When the training begins, the user will monitor some of the data in the training process, including the loss value of each iteration of the training, the error rate, the time spent, and the log of the trainer and the parameter server to monitor. We report the training data to the driver side in the process of implementation, using the Message delivery method (Akka) in the worker side. Performance data for the entire spark job relies on the monitoring capabilities provided by Spark itself, and all information is fed back to the monitoring page (Web UI).
Fault-tolerant mechanism in training process
Because DNN in the training process, the trainer and the parameter server are the place which may fail. The simplest fault-tolerant approach is to regularly back up the model's parameters and training information, and when the model training fails, restart the model training from the backup point. Training Master collects this information and reports it to spark DNN Driver. For the fault tolerance of the parameter server, can take the method of adding redundancy, if a parameter server hangs, training master will be responsible for restarting the corresponding service, but there will be a backup parameter server to be responsible for the parameter update of the parameter server.
Super parameter selection
Fig. 7 Super Parameter Selection training
The Super parameter is the foundation of establishing the model training, the Spark selection module is introduced in the Mllib, the main method is to train the model by a certain super parameter selection algorithm, the final selection of the Super parameter will be used as the final model training. The selection of super parameters is very meaningful for depth learning, including network topology, attenuation rate of parameters and selection of triggering function are the parameters that affect the depth learning. Figure 7 shows a approximate parameter selection process, the feature selection of the model to the domesticating coefficient (regulation Parameter) is paired to train a model, and the final evaluation module chooses the final super parameter. In the spark scenario, the DNN driver will communicate with the evaluation end via RPC to decide what parameters to try. The evaluation-side logic is the Mlapplication service that is driver dependent on spark DNN. If the user needs to select a DNN training model, then spark DNN driver will start multiple training instances based on the different parameters, and then follow the training to see if further searches are required. Spark heterogeneous Distributed computing Platform Architecture
As noted above, we have seen that spark on paddle enables traditional depth learning to run on larger distributed systems. However, Baidu is facing a very real problem is the huge amount of data. Inside Baidu, the amount of data processed every day is far beyond the capacity of the traditional platform, will use a huge number of model parameters, characteristics and training data. These huge amounts of data put forward higher requirements for the performance and scalability of distributed systems. On the one hand, we want to provide a more traditional MapReduce cluster scale depth Learning computing cluster, can run a large number of in-depth learning tasks in parallel, on the other hand, each depth learning model can not be unlimited to cut into smaller units, so the model processing capacity of each node is also critical.
At present, CPU-based computing nodes are limited by their own computational ability, which can not meet the needs of computation, so we need to accelerate the computing platform by more powerful heterogeneous computing. At present, our project mainly involves two kinds of computing resources: GPU and FPGA. The GPU provides powerful computing power for high-density computing types; The FPGA has low-power, highly customizable features that are suitable for accelerating many specific dynamic tasks (the FPGA hardware acceleration used in this project is provided by the computing team of the Baidu US Research and Development center).
Our project is based on the spark on paddle, exploring how to effectively integrate heterogeneous resources into today's large-scale distributed systems to provide high application performance and ease of use. On the basis of satisfying the requirements mentioned above, the system needs to dynamically manage the GPU/FPGA resources and perform seamless scheduling, just as the CPU and memory resource scheduling. This functionality is achieved by consolidating resource scheduling into open source yarn systems, while resource isolation schemes are based on the industry's popular container technology.
At the same time, we also need to provide a simple and easy-to-use programming interface so that existing applications can migrate to our systems faster. Because all of Spark's data is based on RDD, we create a new kind of rdd, through which the program can directly use the GPU/FPGA at the bottom to speed up the corresponding calculations. We know that the function of actually completing the program on GPU/FPGA also needs to provide kernels, where we use the industry's most popular standard OpenCL interface to migrate programs to different GPU/FPGA. As you can see, a specific feature implementation requires 3 parts: A Scala Driver, a C + + worker, and a OpenCL Kernel (on GPU/FPGA). If the commonly used functionality is already integrated in Mllib, then the user simply needs to create their own Scala Driver, and can seamlessly enjoy the acceleration of GPU/FPGA resources with functions already supported in the new RDD call library.
Fig. 8 Spark heterogeneous Computing platform architecture
The heterogeneous system architecture is shown in Figure 8. The system runs as follows: first the user application (Scala Driver) is started by App Master, and then the user application asks the yarn for the resources it needs, in which the GPU, the FPGA as a different resource class, is exactly the same as the request CPU resource; The user application gets all the resources, and the app master starts a Scala worker on the container running the user program on the corresponding app slave, and then, according to the requirements of the program Scala worker, if a new rdd is used, the corresponding C + + OpenCL Program, if function is mllib embedded, then this part of the user is completely transparent. After the OpenCL program is started, the assigned data is transferred to the GPU or FPGA, and then the specific OpenCL Kernel are dynamically initiated on the GPU or FPGA to handle the transmitted information. After the OpenCL kernel is finished, the data is automatically pulled back into main memory, and the OPENCL program can return the results to Scala worker; Finally, all Scala worker submits the results to the user programs that run on app master Scala Driver.
As you can see, the entire process supports the addition of new GPU/FPGA computing resources, as well as the need for users to use the new Rdd. There are no additional changes to the user program. performance evaluation of spark heterogeneous platform
After the architecture of heterogeneous platform is built, we first test the CPU and GPU performance comparison of the machine learning base matrix. The results show that the GPU accelerates well and is about 30 times times faster than the CPU when performing the same calculation equation. At the same time, the Baidu U.S. Research and Development Center Computing team also Kmeans algorithm with FPGA acceleration, achieved 15 to 20 times times the acceleration, and the FPGA energy consumption is only 20% of the CPU. In the second experiment, we compared the GPU to CPU acceleration ratio of spark on paddle when training imagenet, and found that the GPU can be accelerated 30 times times, that is, after using heterogeneous platforms we can do the same calculation with only 3% of the machine resources.
After a good understanding of the speedup ratio of heterogeneous platforms, we also study the scalability of heterogeneous platforms. Test results as shown in Figure 9, basically with the increase in GPU resources, computing time is also linearly reduced, showing a strong scalability, can withstand a large amount of data and calculation.
Fig. 9 Performance data of spark heterogeneous computing platform
Summarize
This paper introduces the heterogeneous distributed depth learning system based on spark in Baidu. Combining the spark with the depth learning platform paddle solves the data path problem between the paddle and the business logic, so that the business can use the deep learning technology easily. On this basis, we use the GPU and FPGA heterogeneous platform to greatly enhance the data processing capacity of each machine. On heterogeneous platforms, we use yarn to allocate heterogeneous resources to support multi-tenancy and make resource use more efficient. Next work we intend to extend the platform to Baidu different business platform, such as voice, Baidu secretary, Baidu map search, Baidu unmanned vehicles, etc., so that the platform in different business temper. After the platform is more mature, we intend to spark on paddle and the Heterogeneous computing platform Open source, feedback community.