How to choose the right distributed machine learning platform

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Guided reading: machine learning and deep learning is a hot topic in recent years, it is a very disturbing problem to choose from many machine learning platforms. In this paper, the design methods used in the Distributed machine Learning (ML) platform are investigated and the future research directions are proposed.


This article compares the machine learning platform design method and the use guide, is I and Kuo Zhang and Salem Alqahtani classmate cooperation. We wrote this article in the fall of 2016 and submitted this article in ICCCN ' 17 (Vancouver).

Machine learning, especially deep learning (DL), has been successful in speech recognition, image recognition and natural language processing as well as referral/search engines. These technologies are widely used in autonomous driving cars, digital health systems, CRM, advertising, and the Internet of things. As these capital moves into further technological change, we have seen many machine learning platforms.

Because of the huge datasets and model sizes involved in training, the machine learning platform is typically a distributed ML platform that typically runs 10-100 workers in parallel to train a model. It is estimated that the vast majority of data center tasks will become machine learning tasks in the near future.

So we decided to look at these ML platforms from a distributed system perspective and analyze the communication and control bottlenecks of these platforms. We also studied the fault tolerance and ease of programming for these platforms.

We classify distributed ML platforms according to 3 basic design methods:


    1. Basic Data Flow

    2. Parametric server model

    3. Advanced Data Flow


We briefly describe each method, using Apache Spark as an example of the basic data flow method, PMLS (Petrar) as an example of the parametric server model, TensorFlow and MXNet as examples of the advanced data flow model. We provide evaluation results for performance evaluations. For more evaluation results, please refer to the paper. Unfortunately, as a small team from academia, we cannot carry out scale evaluations.

At the end of this article, I will present a summary and recommendations for the future work of the distributed ML platform. If you already have experience with these distributed ML platforms, skip to the bottom of the list.


Spark


In Spark, the calculation is modeled as a directed acyclic graph (DAG), where each vertex represents an elastic distributed dataset (RDD), and each edge represents an operation on the RDD. An RDD is a collection of objects divided into logical (in-memory or swap-to-disk) partitions.

On a DAG, the edge e from vertex A to vertex B means that Rdd B is the result of executing action E on Rdd A. There are two kinds of actions: transformations and actions. Transformations (for example, mappings, filters, connections) perform operations on the RDD and generate new Rdd.


The Spark user modeled the calculation as a DAG, which transforms and runs the operation on the RDD. The DAG is compiled in stages. Each stage is executed as a series of tasks that run in parallel (one task per partition). Narrow dependencies are good for efficient execution, while wide dependencies introduce bottlenecks because they disrupt the execution pipeline and require traffic-intensive operations.



achieve distributed execution in Spark by differentiating the DAG phase. The diagram shows the schema of Master. The driver contains two scheduler components, DAG Scheduler and Task Scheduler, as well as task and coordinator worker.


Spark is designed for general data processing, not for machine learning. However, using MLlib on Spark, you can do ML on spark. In the basic settings, Spark stores the model parameters in the driver node, and the worker communicates with driver to update the parameters after each iteration. For large-scale deployments, model parameters may not match driver and will be maintained as RDD. This introduces a lot of overhead because you need to create a new RDD in each iteration to save the new model parameters. Updating the model requires mixing data between machines, which limits the scalability of Spark. This is where the basic data flow model (DAG) in Spark is deficient. Spark does not support the iterative process required in ML.


Pmls


The PMLS is designed for ML. It introduces the parameter server (PS) to serve the iterative intensive ML training process.


PS (shown in the green box in the figure) uses the distributed memory key-value store to maintain parameter information. It can replicate and shard: The master node of each node as a shard of the model (parameter space), and a copy of the other shards. Therefore, PS is easily extended relative to the number of nodes.

The PS node stores and updates model parameters and responds to worker requests. The worker requests the latest model parameters from its local PS copy and performs calculations on the dataset partitions assigned to them.

PMLS also employs the synchronous parallel (SSP) model, which relaxes the limit of batch synchronization parallelism (BSP) that workers synchronize at the end of each iteration. The SSP reduces worker synchronization difficulties and ensures that the fastest worker cannot be performed before the slowest worker. Due to the noise tolerance of the processing process, the conformance model still applies to ML training. I introduced it in a April 2016 blog post [1].

TensorFlow

Google has a parametric server model, called Distbelief, based on the distributed machine learning platform. (This is my review of the distbelief paper [2]), and the main problem with Distbelief is that it needs to be mixed up in low-level code in ML applications. Google wants any of its employees to be able to write ML code without requiring them to be proficient in distributed execution-the same reason Google wrote the MapReduce framework for big data processing.


So TensorFlow aims to achieve this goal. The tensorflow uses a data flow paradigm, but the calculation diagram does not require a DAG advanced version, but can include looping and supporting mutable states. I think Naiad[3] design may have some impact on tensorflow design.


TensorFlow represents the calculation using a forward graph of nodes and edges. A node represents a calculation with a mutable state. The edge represents a multidimensional data array (tensor tensor) passed between nodes. TensorFlow requires the user to statically declare this symbol calculation diagram, and use the partition and rewrite of the graph to achieve distributed computing. (MXNet, especially DyNet, can dynamically declare symbolic calculation diagrams, which increases the simplicity and flexibility of programming.) )



The method of using the parameter server in TensorFlow distributed ML training. When using PS abstraction in TensorFlow, you can use parameter servers and data parallelism. TensorFlow can do more complicated things, but it requires writing custom code and entering uncharted territory.


Some evaluation results


For our assessment, we used the Amazon EC2 m4.xlarge instance. Each consists of 4 Vcpus (Intel Xeon Processor e5-2676 v3) and 16GB RAM. EBS Bandwidth is 750Mbps. We evaluated using two commonly used machine learning tasks: 2-level logistic regression and image classification using multilayer neural networks. I'm just offering a couple of contrast charts here, and you can check out our paper for more experimentation. Our experiment has several limitations: we use fewer machines and cannot test scale. We are also limited to CPU computing and do not use GPU testing.



The graph shows the speed of the logical regression execution. Spark is only after pmls and MXNet.


The graph shows the speed of the DNN platform. Compared to a single-layer logistic regression, Spark has a greater performance penalty than a two-tier neural network. This is because more iterative calculations are needed. We save the driver parameters in Spark, and if we save the parameters in the RDD and update them after each iteration, the situation will be worse.


The graph shows the CPU utilization of the platform. Spark applications appear to have higher CPU utilization (serialization overhead).


Conclusion and future direction


Parallel computing for ML/DL applications is not very good, and is slightly uninteresting from the perspective of concurrency algorithms. The parametric server on the distributed ML platform is the deciding factor.

As far as bottlenecks are concerned, the network remains a bottleneck for distributed ml applications. Rather than working on a more advanced common data flow platform, better data/models are available, and data/models are treated as first-class citizens.

However, there may be some surprises and subtleties. In spark, CPU overhead is becoming a bottleneck [4]. The programming language (that is, SCALA/JVM) used in spark can significantly affect its performance. Therefore, better tools are needed for monitoring and/or performance prediction of distributed ML platforms. Recently, some tools have been proposed to solve problems with spark data processing applications, such as ernest[5] and cherrypick[6].

The distributed system supports the ML runtime with many open issues, such as resource scheduling and runtime performance improvements. Through the runtime monitoring/analysis of the application, the next generation distributed ML platform should provide compute, memory, and general runtime elastic configuration and scheduling for the running tasks.

Finally, there are problems with programming and software engineering support. What is the [distributed] programming abstraction for ML applications? This also requires more analysis of the validation and validation of ML applications (testing of DNN with a particularly problematic input).


    1. Https://muratbuffalo.blogspot.com/2016/04/petuum-new-platform-for-distributed.html

    2. Https://muratbuffalo.blogspot.com/2017/01/google-distbelief-paper-large-scale.html

    3. Http://muratbuffalo.blogspot.com/2014/03/naiad-timely-dataflow-system.html

    4. Http://muratbuffalo.blogspot.com/2017/05/paper-summary-making-sense-of.html

    5. https://spark-summit.org/east-2017/events/ernest-efficient-performance-prediction-for-advanced-analytics-on-apache-spark/

    6. https://blog.acolyer.org/2017/05/04/cherrypick-adaptively-unearthing-the-best-cloud-configurations-for-big-data-analytics/


English Original: http://muratbuffalo.blogspot.com/2017/07/a-comparison-of-distributed-machine.html


The author Murat, translated by Jesse, please specify the source, technical original and architectural practice articles, welcome to the public number menu "Contact us" to contribute.


Recommended Reading


  • One click to build a deep learning platform based on Docker/mesos and nvidia GPU detailed tutorials

  • Slow iteration? How to design a mature machine learning stream: micro-broad-scale machine learning framework Weiflow Secret

  • Using deep learning method to realize facial expression packet recognition

  • Spark's Rdd principle and introduction to 2.0 features

  • Machine learning Platform Pain point and model lifting method: Spark-based machine learning platform in point-to-melt network wind Control application Introduction


Highly Available architecture


long press QR code follow "High Availability Architecture" public number

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.