Mxnet's Ps-lite and parameter server principles

Source: Internet
Author: User
Tags mxnet

Mxnet's Ps-lite and parameter server principles

The Ps-lite framework is the parameter server communication framework that is implemented by the DMLC group, and is the core of DMLC other projects, for example, its deep learning framework mxnet distributed training relies on the implementation of Ps-lite.

Parameter Server principle

In the field of machine learning and deep learning, distributed optimization has become a prerequisite, because single-machine has not solved the current rapid growth of data and parameters caused by the problem. In reality, the number of training data may reach between 1TB and 1PB, while the parameters in the training process may reach \ (10^9\) to \ (10^{12}\). Often, the parameters of these models need to be accessed frequently by all worker nodes, which poses many problems and challenges:

    • Access to these massive parameters requires a large amount of network bandwidth support;
    • Many machine learning algorithms are continuous, only the last iteration (each worker is completed), before the next iteration, which leads to a large performance gap between machines (cask theory), it will cause a great loss of performance;
    • In distributed, fault-tolerant capability is very important. In many cases, the algorithm is deployed to the cloud (in this environment, the machine is unreliable and the job is likely to be preempted).
Synchronization and asynchronous mechanism in distributed systems

Figure 1 in a synchronous mechanism, the time the system is running is determined by the slowest worker node and the communication time

Figure 2 under the asynchronous mechanism, each worker cannot wait for the other workers to complete before running the next iteration. This can improve efficiency, but it slows down the convergence rate from the point of view of the number of iterations. Parameter Server architecture

In parameter server, each server is actually only responsible for the partial parameters (servers co-maintains a global shared parameter), and each work is only part of the data and processing tasks.

Figure 3 Parameter server's schema diagram, the server node can communicate with other server nodes, each server is responsible for its own assigned parameters, and the server group maintains updates for all parameters together. Server Manager node is responsible for maintaining some metadata consistency, such as the state of each node, the allocation of parameters, etc., there is no communication between the worker nodes, only the corresponding server to communicate with. Each worker group has a task scheduler that is responsible for assigning tasks to the worker and monitoring the worker's performance. When a new worker joins or exits, Task Scheduler is responsible for reallocating the task.

PS architecture includes two parts of computational resources and machine learning algorithms. Where the compute resources are divided into two parts, the parameter server node and the working node:

    • Parameter server nodes are used to store parameters
    • The work node part is used to do the algorithm training

Machine learning algorithms are also divided into two parts, namely parameters and training:

    • Parameter part is the model itself, there is the requirement of consistency, parameter server can also be a cluster, for large-scale algorithms, such as DNN,CNN, the parameters of billions of times, naturally need a cluster to store so many parameters, therefore, parameter server also need to dispatch.
    • The training part is parallel in nature, otherwise it can't embody the advantage of distributed machine learning. Because of the existence of the parameter server, each compute node, after getting the new batch data, takes the latest parameters from the parameter server, calculates the gradient, and then updates the gradient back to the parameter server.

There are two benefits of this design:

    • By modularization the machine learning system in common, the algorithm implements the code more succinctly.
    • As a system level shared platform optimization method, PS structure can support many algorithms.

As a result, PS architecture has five features:

    • Efficient communication: Asynchronous communication does not slow down calculations
    • Consistent elasticity: The condition of consistency of the model is relaxed, allowing a balance between the convergence speed of the algorithm and the system performance.
    • Strong extensibility: Add nodes without restarting the network
    • Error tolerance: Machine error recovery time is short, Vector clock allows network error
    • Ease of use: Globally shared parameters are represented using vectors and matrices, which can be optimized with high-performance multiline libraries.
Push and Pull

In parameter server, parameters can be represented as a collection of (key, value), such as the problem of minimizing a loss function, where key is the feature ID, and value is its weight. For sparse parameters, the nonexistent key can be considered as 0.

The parameters are represented as k-v, and the form is more natural, easy to understand and easier to program. Workers and servers communicate through push and pull. The worker sends the computed gradient to the server via push, and then updates the parameters from the server via pull. To improve computational performance and bandwidth efficiency, parameter server allows users to use range push with a range pull operation.

Task:synchronous and asynchronous

The task is also divided into synchronous and asynchronous, with the difference as shown:

Figure 4 If Iter1 need to Iter0 Computation,push and pull are completed before they can start, then synchronous, and vice versa is asynchronous. Asynchronous can improve the efficiency of the system (because it saves a lot of waiting processes), but it may reduce the convergence rate of the algorithm;

So, there is a balance between the system performance and the algorithm convergence rate, and you need to consider both:

    • Sensitivity of algorithm to non-conformance of parameters
    • The degree of correlation between training data features
    • Storage capacity of the hard disk

Parameter server provides users with a variety of task dependencies, given the different scenarios that users can use:

Figure 53 different ways to rely
    • Sequential: Here is actually the synchronous task, the tasks are ordered, and only the previous task is completed before the next task can be started.
    • Eventual: In contrast to sequential, all tasks are not ordered, and each task is completed independently.
    • Bounded delay: This is a balance between sequential and eventual, which can be set to a \ (\tau\) as the maximum delay time. In other words, only tasks that are greater than \ (\tau\) are completed before a new task can be started; Extreme situations:
      • \ (\tau=0\), the situation is sequential;
      • \ (\tau=\infty\), the situation is eventual;
The algorithm under PS

Algorithm 1 is a direct algorithm without optimization and its flowchart is as follows:

Figure 6 Algorithm 1

Figure 7 Flow of algorithm 1

Figure 8 Algorithm 1 after the optimization algorithm 3.

The KKT filter in algorithm 3 can be a user-defined filter:
For machine learning optimization problems such as gradient descent, not every calculation of the gradient is valuable to the final optimization, users can filter out some unnecessary transfers through custom rules, and then further compress the bandwidth consumption:

    1. Sending very small gradient values is inefficient:
      Therefore, you can customize the settings, only when the gradient value is large, send;
    2. Updating values that are close to optimal conditions is inefficient:
      Therefore, only in the case of non-optimal transmission, can be judged by KKT;
Ps-lite implementation

It says the principle of parameter server, now see how this is achieved. Ps-lite is a program that DMLC implements parameter server and is also one of the core components of mxnet.

Ps-lite role

Ps-lite consists of three roles: Worker, Server, Scheduler. Specific relationships such as:

Figure 93 Diagram of roles

The worker node is responsible for calculating the parameters, and the concurrency parameter is push to the server while returning from the Serverpull parameter.
The server node is responsible for managing the parameters sent by the worker node and "merging" them for use by each worker.
The scheduler node manages the state of the worker node and the server node, and the connection between the worker and the server is through scheduler.

Important class

Figure 10 Diagram of an important class
  • Postoffice is a global management class, created in singleton mode. It is used primarily to configure some of the current node's information, such as what type of node is currently (Server,worker,scheduler), what Nodeid is, and Worker/server's rank-to-node ID conversion.

  • Van is the class responsible for communication and is a member of the postoffice. Van Std::unordered_map Senders_ saved node_id-to-connect mappings. Van just defines the interface, the implementation is dependent on the ZMQ implementation of the Zmqvan,van class is responsible for establishing the connection between the nodes (such as the connection between worker and scheduler), and to open the local receiving Thread is used to listen for the message received:

  • Customer is used to communicate, tracking request and response. Each connection corresponds to a customer instance, and the ID of the connection is the same as the ID of the customer instance.

  • Simpleapp is a base class that provides a body message that sends a head and a string of type int, and registers a message handler function. It has 2 derived classes.

  • Kvserver is a derived class of simpleapp that is used to hold key-values data. The inside process () is registered in the Customer object, and when the receiving thread of the customer object receives the message, the process () is called to process the data.

  • Kvworker is a derived class of Simpleapp, mainly with push () and pull (), which eventually call the Send () function, and send () slices the kvpairs, because each server retains only a subset of the parameters. As a result, the segmented slicedkvpairs are sent to different servers. The sharding function can be overridden by the user, by default defaultslicer, each slicedkvpairs wrapped into a message object and then sent with Van::send ().

  • The kvpairs encapsulates the key-value structure and also contains a length option.

  • Sarray is a shared array that shares data like a smart pointer, and interfaces like vectors.

  • Node encapsulates information about the nodes, such as the role, IP, port, and whether it is a recovery node.

  • Control encapsulates controlling information, such as the command type, the destination node, the ID of the Barrier_group, and the signature.

  • Meta data encapsulates the metadata, sender, recipient, timestamp, request, or corresponding.

  • Message is the information to be sent, in addition to the metadata, including the data sent.

Run the script

To better see how ps-lite works, let's look at the script that runs locally:

#!/bin/bash# set-xif [ $# -lt3 ]; Then    Echo "Usage: $num_servers Num_workers bin [args..] "    Exit-1;fiExport dmlc_num_server=$1ShiftExport dmlc_num_worker=$1Shiftbin=$1Shiftarg="[email protected]"# Start the schedulerExport dmlc_ps_root_uri=' 127.0.0.1 'Export dmlc_ps_root_port=8000Export dmlc_role=' Scheduler '${bin} ${arg} &# Start ServersExport dmlc_role=' Server ' for ((i=0; i<${dmlc_num_server}; ++i)); Do    Export heapprofile=./s${i}    ${bin} ${arg} & Done# Start WorkersExport dmlc_role=' worker ' for ((i=0; i<${dmlc_num_worker}; ++i)); Do    Export heapprofile=./w${i}    ${bin} ${arg} & Donewait

This script mainly does two things, the first one is to set environment variables for different roles, and the second is to run several different roles locally. So Ps-lite is to have a number of different processes (programs) to work together to complete the job, Ps-lite is to use environment variables to set the configuration of the role.

Test_simple_app process

Test_simple_app.cc is a very simple app, the other complex process principle of this program is similar, so we say that this program is how to run. First of all, let's see how the Worker (W) \server (S) \scheduler (H) is connected between the start and the beginning of the program, and there is no process for customer handling ordinary information. W\s\h represents the process flow within different role programs after each role is run by the above script.

    • W\s\h: Initialize Simpleapp to New customer (BIND process function)--a receiving thread from customer
    • W\s\h: Initializes the static postoffice, which is used globally with the same postoffice----and Create (Van) to make the communication----read the configuration from the environment variable and determine the different roles.
    • W\s\h:start ()--Van::start (), My_node_/scheduler initialization
    • W\s: Bind port and connect to the same scheduler
    • w\s: Send message to specified ID
    • W\s\h: A reciving thread in Van
    • H: Receive information and send back
    • W\S: Message Received
    • W\s\h:finalize ()

The customer processes the general information flow as follows:

    • H:app->requst ()--put this request into Tracker_--Send (MSG)--app->wait () [Message waiting to be recalled]
    • W/S: Get the message and put it in Recv_queue_
    • W/s:customer reciving receives information--call Recv_handle_--process (recv) [Processing information]--response_hadle_ (recv)-- Reqhandle ()-Response () [postback info]
    • H: Receive postback information--into recv_queue_ processing in reciving in customer
    • H: when Tracker_.first = = Tracker_.second, release app->wait ()

Reference:
[1] http://blog.csdn.net/stdcoutzyx/article/details/51241868
[2] http://blog.csdn.net/cyh_24/article/details/50545780
[3] https://www.zybuluo.com/Dounm/note/529299
[4] http://blog.csdn.net/KangRoger/article/details/73307685

"Formatting problems that prevent the crawler from being reproduced-links":
Http://www.cnblogs.com/heguanyou/p/7868596.html

Mxnet's Ps-lite and parameter server principles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.