Adam: A large-scale distributed machine learning framework

Source: Internet
Author: User

Introduction

Reprint Please specify: http://blog.csdn.net/stdcoutzyx/article/details/46676515

It's been a long time since I wrote a blog, I remember having a look at Ng God's interview, if I read three papers a week, after a few years, it will inevitably become a very familiar person in a certain field.

Unfortunately, in the middle of the busy, I can't even do this. However, my current intention is to do my best, even if I only read one article a week. Mr. Hu Shi once said: "Fear of what truth is infinite, further there is further joy." The difference, however, is that I have not reached the point of seeking truth, I just want to see how this technology is going to plug.

I think, for a lot of people like me, not trained themselves to learn ml, there must be a lot of time to feel that the theoretical derivation of ML and other things are stretched, although many times want to am their own to do a good job of math, but there is no egg use. Have to admit that some things still have to be directed to a step-by-step down to learn, their headless flies like a blind learning will soon run out of energy and enthusiasm, I think this is the need to read Bo it.

However, for me who just want to be a quiet programmer, in a different perspective, if you want to be a good programmer, in fact, too much of the theory is not needed, more understanding of the implementation of some algorithms may be more beneficial. So, I think this blog is more practical, because it is not in theory to do a big improvement and improve the effect, but a distributed machine learning algorithm implementation.

Adam

For a report on Adam, see [3].

This blog is to read the notes obtained from the paper, the paper is drawn with the paper, the name of the paper see reference [1].

Adam is Microsoft Research's deep Learning project, which is still using convolutional neural networks for image classification, the effect has improved a lot, but from the perspective of my thesis, Adam is more inclined to the implementation of distributed framework, rather than theoretical innovation, since Alex and Hinton issued [2] in 2012, In fact, the core of convolutional neural networks has not changed greatly, but the convolution neural network from academia to industry, I think Adam's contribution is shallow.

So why do you need a framework like Adam?

    1. Machine learning Method A lot of, but only convolutional neural network can hold the image, because the image speech class data is too difficult to extract the characteristics caused.
    2. convolutional neural networks have long appeared, and why they are now exerting their power, the main benefit and the improvement of computing power.
    3. To get good results, want to be in the computational power is not enough, you must have big Data + big model is good, so in order to solve the large model of the growing demand for computing power and now the contradiction between the computer, Adam turned out, no conditions to create conditions to be on, put dozens of hundred machines effectively connected together, Form a powerful computing power.

OK, words do not say, soft words, began to say the technology bar, if there is a mistake, also look correct.

Architecture

The ADAM framework is still based on the Multi-spert architecture, which has a broad meaning of dividing the cluster into the following sections:

    1. Data Service class. Store data, backup data. Provides data to compute nodes.
    2. Training model classes. Train the model, and then update the parameters.
    3. Parameter server. Maintain a shared model, and after compute node calculation is complete, you can send request update parameters to the parameter server.
Data Serving

There are specialized servers for providing data that, while providing data, do some transformations (inversion, skew, and so on) in advance of the image.

When data is provided, it is accelerated using preloaded memory, and asynchronous I/O using a background process is pre-loading the images that are accessed into memory so that the images that will be accessed remain in memory.

Model Training

Adam's training is still the model Alex proposed, with five convolutional layers paired with three fully connected layers. In Adam, these models are sliced vertically. As shown below:

Multi-threaded Training

The models on a single machine are multithreaded, and these threads share a model parameter.

During the run, each thread is assigned a different picture for training.

However, the context of each thread's computing environment (both forward and reverse) is separate, and the context computing environment is pre-allocated to prevent heap locks.

The context and the cache of intermediate results for each thread use Numa-aware to allocate space to reduce cross-memory bus traffic.

Fast Weight Updates

To speed up training, local updates to shared model parameters are unlocked. Each thread calculates the weight update value and then updates the model directly.

This, of course, leads to inconsistencies, but experiments have shown that such updates can still be convergent.

The reason for convergence may be that the elasticity of the neural network can hold the noise caused by such inconsistencies. It is also mentioned later that it is possible that this noise makes the model more capable of generalization.

Reducing Memory Copies

Because the data of the model needs to be passed between the layer and the layer, and the model is segmented, so the transfer of many data is non-local.

For local transfers, pass pointers instead of values.

For non-local delivery, a network library based on the Windows Socket API is built, which accelerates data transfer with no specific information.

Static model segmentation is used to optimize the amount of information that needs to be passed between models. (What is static model segmentation?) )

Use reference counting to ensure the security of asynchronous network IO.

Memory System Optimization

The model is sliced until the memory of a single session can fit into the L3 cache. The L3 cache is more unaffected by memory bandwidth than the use of the primary floating-point unit.

Both forward and reverse propagation calculations have good computational locality requirements, so it is easier to package data and then use matrix operations to make use of local and floating-point computing units.

Mitigating the Impact of Slow machines

Even if the machine configuration is the same, there will still be a speed in the model calculation process.

To prevent threads on a fast machine from being dragged down by threads on a slow machine, allowing threads to handle multiple images in parallel, using a data flow framework that triggers a process on an image when the data arrives.

At the end of each iteration, it is often necessary to wait for all images to be processed and then calculate the error rate on the validation set before deciding whether the next iteration is needed. However, this would be troublesome, and it would take the slowest machine to run. Therefore, a strategy has been devised that, after 75% of the images have been processed, the model is evaluated and the next iteration is determined. After the 75% setting, use randomization to ensure that the same set cannot be skipped the next time. This strategy can accelerate by more than 20%.

Parameter Server Communication

Two methods of updating the model are designed:

    • The local compute parameter updates the value when the K-image is processed, sends an update request to the parameter server, and then parameter the server to update the parameters directly. This method is suitable for convolution layers because many of the parameters of the convolution layer are shared. For the full join layer, there are too many weights that need to be updated, and this way consumes too much. You need to use the following method.
    • Instead of transmitting the parameter updates, transfer the activation value and the error gradient vector, and then activate the value and error gradient vectors on the parameter server for the matrix calculation to get a weight update. The second benefit of transferring the data from MXN to KX (M+N) is to move some computations to the parameter server, which enhances the balance of the system.
Global Parameter Server

The architecture of the Parameter server is as follows, a standard traditional distributed k-v storage structure, but for the training of convolutional neural networks, the parameters are updated too quickly and need to be optimized.

Throughput optimizations

The parameters of the model are cut into 1M sizes of shards, which are hashed into different buckets and then averaged across the various parameter servers. This method increases the spatial locality and is easy to load balance when updating.

Bulk update parameters. It is beneficial to locality and relieve the pressure of L3.

The parameter server uses the SSE/AVX directive. All of the processing is NUMA aware. (This is not exactly what I'm saying, it's hardware-related).

Use a lock-free queue structure and hash table structure to accelerate network transport, updates, and hard disk IO processing.

Lock-free memory allocations are implemented through the memory pool.

Delayed Persistence

To increase throughput, the persistence is decoupled from the update process. The parameter store is modeled as a write-back cache, and the dirty data block is returned asynchronously back to the background.

Part of the data loss is also tolerable, because the DNN model is resilient. However, for lost data, it can be easily re-trained.

Deferred persistence can also allow compression writeback, since the addition of parameter updates, cache updates After many rounds, it is also possible to write back.

Fault tolerant operation

Each parameter is a three-part cache. The allocation information for these parameter servers is stored on the machine of a parameter server controller. The controller and the parameter server are synchronized through the heartbeat.

When parameters are updated, the backup server and the primary server are synchronized through the heartbeat.

When a machine loses its heartbeat, the controller re-selects the primary server.

Evaluation

Experiments were performed on all 22,000 classes of imagenet, and the top-1 accuracy rate was used for evaluation.

Model Training

In the absence of a parametric server, train the model to see how many links can be trained per second. As you can see, the acceleration ratio is super-linear, because the more machines, the larger the memory, the data in memory, natural fast.

Parameter Server

After adding the parameter server, look at the acceleration ratio. It can be found that the speedup is less than pure local computing, but overcomes the bottleneck encountered by the weight Update method of the 8 machine.

Scaling with more workers

Increase the number of machines to be used for training and see the speedup ratio. The increased number of machines used for training ensures that the number of parameters on each machine is constant, increasing the size of the model, and thus increasing the numbers of machines, but the traffic on the network does not affect acceleration by the graph.

Scaling with more replicas

The model size is constant, but the number of copies of the parameter is increased, that is, the parallelization of the data becomes larger. Look at the acceleration situation.

Performance

The effect is as follows, lifting greatly drops. As the model becomes larger, the effect becomes better.

Summarize

The main contribution of the thesis:

    1. Optimize and balance computing and communication by designing systems. Minimizes memory bandwidth usage and communication between machines in a distributed model.
    2. The use of machine learning training process to tolerance of inconsistencies, improve the effectiveness and cluster extensibility. Improve scalability with multithreaded models, lock-free updates, asynchronous batch update technology, and more. In addition, asynchronous training can help improve the effectiveness of the algorithm.
    3. It is proved that system performance, expansibility and asynchronous training all help to improve the accuracy of the model. Using less than 30 machines to train a 2 billion-connected model, the Imagenet 22000 class of data to twice times the accuracy of the previous, the data is sufficient, the larger the model, the better the effect.
Resources

[1]. Chilimbi T, Suzue Y, Apacible J, et al Project adam:building an efficient and scalable deep learning training Syste m[c]//11th USENIX Symposium on Operating Systems Design and implementation (OSDI 14). 2014:571-582.

[2]. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[c]//advances I n Neural information processing systems. 2012:1097-1105.

[3]. http://www.tuicool.com/articles/IbAZFb

Adam: A large-scale distributed machine learning framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.