Cross-platform Caffe and I/O model and parallel scheme (v)

Source: Internet
Author: User
5. Parameter Server 5.1 Background information

In the field of machine learning and deep learning, single machine can not solve the current rapid growth of data and parameters, distributed optimization has become a prerequisite. In reality, the number of training data can reach between 1TB and 1PB, while the parameters in the training process can reach 109 to 1012. Often the parameters of these models need to be accessed frequently by all the worker nodes, which poses many problems and challenges:

Access to huge amounts of parameters requires a large amount of network bandwidth support.  Machine learning algorithm has continuity, only the last iteration completed (each worker is completed) before the next iteration, which leads to a large performance gap between machines (barrel theory), will cause a great loss of performance. The Distributed machine learning system needs fault-tolerant capability. The algorithm is deployed in a cloud environment, the machine is unreliable, and the job is likely to be preempted.

The traditional distributed learning framework has difficulty in solving these problems: the main disadvantage of MPI gradient aggregation is the low speed of batch task Solver, while the MPI itself cannot support large-scale data. MapReduce: It solves the problem that MPI cannot support large data, but can not improve the training performance of batch solver, and introduces new problems, including the inefficiency of iterative computation and inefficient communication between nodes. The Graphlab introduced by CMU is a graph based abstraction, which can solve many machine learning problems, but there are still many problems that cannot be solved efficiently, such as multilayer structure in depth learning. The parameter server is a new data parallel abstraction specially used in machine learning algorithm in recent years. By using a distributed Key-value storage model to store parameters, it provides an effective mechanism for synchronizing model parameters between different worker nodes in a distributed system, and each worker only needs to save a fraction of the parameters it relies on to compute. This avoids the excessive communication overhead caused by frequent parameter data interaction. 5.2 Parameter Server System Architecture

The concept of the parametric server was first derived from the parallel LDA framework proposed by Alexsmola in 2010 [5]. It uses a distributed memcached as storage for parameters, which provides an effective mechanism for synchronizing model parameters between different worker nodes in a distributed system, and each worker only needs to save a fraction of the parameters it relies on to compute. Of course, the storage of the parameters here differs from the Key-value abstraction in the OLTP application, because frequent parametric data interactions with Key-value can lead to excessive communication overhead, so parametric servers usually use mathematical encapsulation for parameter synchronization, such as vectors, tensor, The ranks of the matrices and so on.

Fig. 5-1 framework of parallel LDA

The sampler in the graph above is a component in parallel LDA, which can be compared to a computational unit in a general parameter server framework. Smola proposed the model is the earliest parameter server abstraction, followed by a number of improvements, Google's Jeff Dean 2012 further proposed the first generation of Google Brain solution distbelief[4], mainly for large-scale in-depth learning network training. Distbelief the huge depth learning model is stored in the global parameter server, the compute node transmits information through the parameter server, which solves the problem of the distributed training of SGD and L-BFGS algorithm well.

5-2 distributed SGD Framework flow chart

The graph is a distributed asynchronous SGD architecture flowchart, which needs to divide the training data into multiple subsets, then run multiple copies of the model on each subset, the model is communicated by a centralized parameter server, and the parameter server holds all the parameters and states of the model. Asynchronous manifests in two aspects: The model's copy runs independently, the parameter server's fragmentation also independently runs. Distbelief uses BSP parallel model to maintain data consistency of global parameters. The BSP (Bulk synchronous Parallel) was proposed in the 80 's, which requires that the algorithm be synchronized at the end of each iteration, and therefore slow down the entire system because of the slowest machine.

For the cost and latency of parameter synchronization, the Petuum project team of CMU Eric Xing proposed a delayed synchronization model called the SSP (staleness synchronous Parallel) [6]. The SSP goal is to maximize the iterative throughput by using the machine learning algorithm's self-healing capability, assuming that the theory is correctly guaranteed. The basic idea is to allow the machines to update the model at different paces, but add a limit so that the speed of the fastest machine and the slowest machine are not too big. The advantage of this is that both the slow node drags the whole system down and the final convergence of the model is ensured. In this way, the delay asynchronous control mode avoids the traditional BSP based distributed computation waiting time.

Figure 5-3 ps-lite parameter Server frame diagram

As shown in Figure 5-3, the parameter server architecture Ps-lite[7] implemented by the Li Yu and DMLC teams is a third-generation parameter server that provides a more general architecture than previous distbelief: a server group and several worker groups are included in the design, Server group is used as a parameter server, each server node holds a single parameter fragment, and Server Manager manages the entire server group, maintaining a consistent view of the metadata for the entire server group, and parameter fragmentation. Each worker group runs an application where worker node communicates with server node only to update parameters, and there is no interaction between worker node. Each worker group has a scheduler responsible for assigning tasks and monitoring to the worker nodes, and the scheduler is responsible for rescheduling the remaining tasks if a worker node is hung up or added. Ps-lite for network bandwidth optimization is mainly for parameter replication between the server to provide a way to use the first aggregation and replication:

Figure 5-4 Parameter Transfer

Replication between servers is primarily a fault-tolerant consideration, so data transfer between worker and server still relies on the bandwidth savings of the parameter server's own asynchronous mechanism: when applied to deep learning, it is primarily by means of the so-called delayed block proximal Gradient method: Each iteration updates only one block parameter, the worker node calculates the gradient and also needs to compute the learning rate for the specific coordinates, both the diagonal of the second derivative of the block. In data transfer, Ps-lite also introduces a partially defined filter to avoid the transfer of parameters that are not significant to the model, such as random skip or kkt filter, which allows the transfer of the number of model parameters to be reduced by more than 10 times times. For the proof of convergence of the delayed blockproximal gradient, see Li Yu's article [8]. 5.3-Parameter server disaster-tolerant strategy

The service node and compute node in the parameter server adopt different disaster tolerance strategies. For compute nodes, you can use a restart task, discard failed nodes, or other algorithm-related policies. While the service node maintains the global parameter, if the data loss and the off-line can seriously affect the operation of the application, it requires higher data consistency and recovery timeliness.

In the parameter server, the service node's disaster tolerance uses the consistent hash and the chain backup. When a service node stores model parameters, it maintains a paragraph or several parameters through a consistent hash protocol. This protocol is used to ensure that only service nodes that maintain adjacent parameter segments are affected when a service node changes. The parameters maintained by each service node are backed up at the same time on several other service nodes. When a service node receives data from a compute node, it backs up the data to its backup node before notifying the compute node that the operation is complete. Any failure in the middle will cause this send to fail, but will not result in inconsistent data.

Chain backup is applicable to any machine learning algorithm, but it can multiply network traffic, which may lead to performance bottleneck. For some algorithms, we can reduce communication by adopting a strategy to aggregate and then back up. For example, in a gradient descent algorithm, each service node aggregates the gradient from all compute nodes before updating the model parameters, so you can only back up the aggregate gradient instead of the gradient from each compute node. Aggregations can effectively reduce the amount of traffic required for backups, but aggregation can increase the latency of communication. However, this can be effectively hidden by the asynchronous execution described earlier. When we implement the aggregation chain backup, we can use vector clocks (vectors clock) to record the data of which nodes were received. The vector clock allows us to pinpoint incomplete nodes and minimize the impact of node changes. Because the communication interface of the parameter server is sent based on the section, the keywords in all sections can share the same vector clock compress its storage overhead. 6. Summary and Discussion

In view of the complexity of caffe dependencies and the poor portability, this paper introduces a Cross-platform Caffe project using a third-party library scheme. In this paper, the design idea of Caffe two I/O is analyzed, and the concrete implementation of Caffe multithreading I/O and multiple pre buffering is explained. On this basis, the design and implementation of Caffe multi-GPU data parallel are analyzed. In view of the disadvantages of Caffe source-born parallel scheme, this paper finds that the parametric server framework is suitable for large-scale distributed depth learning system. The analysis of Caffe in this paper provides a powerful support for the research and improvement of multi-platform porting and multi-card distributed caffe in the future. With the rapid development of deep learning, the research of multi-machine multi-card distributed depth learning system will be more in-depth, and the future combination of data parallel and model will combine the advantages of both to obtain better performance. Reference

[1] Flow depth learning tool assessment, Https://github.com/zer0n/deepframeworks

[2] dragon:a Light Deep Learning Framework,https://github.com/neopenx/dragon

[3] Caffe-windows,https://github.com/niuzhiheng/caffe#windows-installation

[4] Jeffreydean, Greg S. Corrado, Rajat Monga, et al, and Andrew Y. Ng Large scaledistributed. In Advances in neuralinformation processing (NIPS2012), MIT Press, Cambridge, MA

[5] alexsmola.an architecture for Parallel Topic Models. VLDB, 2010.

[6] Jamescipar, Qirong Ho, Jin Kyu Kim, Seunghak lee,gregory R. Ganger, Garth gibson,kimberly keeton*, Eric Xing. Solving the straggler problem with boundedstaleness. Presented as part of the 14th Workshop on hot topics in operatingsystems,2013

[7] ps-lite parameter server, Https://github.com/dmlc/ps-lite

[8] Li, Mu and Andersen, David G and Smola, Alex J and Yu, Kai.communicationefficient distributed Machine Learning with the Pa Rameter server.nips,2014


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.