How to evaluate Petuum Distributed machine learning system?

Source: Internet
Author: User
Tags spark mllib value store

Compared to other algorithms in the computer field, machine learning algorithms have some unique features of their own,

(1) Iteration: The update of the model is not done at once, and it needs to be iterated multiple times;

(2) Fault tolerance: Even if there are some errors in each cycle, the final convergence of the model is not affected;

(3) Non-uniformity of parameter convergence: Some parameters in the model are no longer changed after several cycles, and other parameters take a long time to converge.

These characteristics determine that the design of a distributed machine learning system differs greatly from that of other distributed computing systems, such as Spark.

Petuum is a distributed platform specifically for machine learning algorithms, and the common distributed computing system Spark is based on data flow applications, so the application objects are different. Spark has a machine learning library mllib, but it is built on data flow operations and is not designed for the characteristics of machine learning algorithms.

Petuum system design is based on the characteristics of machine learning, currently contains two main modules: Key-value store and scheduler, mainly dealing with two kinds of parallelization methods: (1) data parallelism; (2) model parallelism.

Data parallelism, in a nutshell, is the distribution of data to different machines, each of which calculates a model update, which is then aggregated and updated with the model.

Model parallel, the model parameters are segmented and placed on different machines, each machine to its own part of the update.

The Key-value Store module is responsible for data parallelism, the architecture used is parameter server, and the conformance protocol is staleness synchronous Parallel (SSP). The basic idea of the SSP is to allow the machines to update the model at different paces, but add a limit so that the fastest machine's progress and the slowest machine's progress are not too large. The advantage of this is that both the slow machine drags the whole system back and the final convergence of the model is ensured. By adjusting the staleness parameters of the SSP, the SSP can be converted into a common BSP (Bulk synchronous Parallel) protocol for the data flow system or an ASP used by an early machine learning system (such as Yahoo! LDA) (Asynchronous Parallel).

Another module, scheduler, is used for model parallelism. The programming interface provided by scheduler mainly consists of three operations: (1) Schedule: The scheduling node automatically selects a subset of parameters to be updated according to the interdependence of the model parameters and the convergence of the parameter; (2) Push: The scheduling node computes the update for the selected parameters in parallel by the compute nodes, and (3) Pull: The Dispatch node collects updates from the compute nodes, updating the parameters.

The main features of the petuum include: (1) High performance, (2) programmability, and an easy-to-use programming interface that enables users to implement their own machine learning algorithms on petuum. (3) A wide range of machine learning libraries. We implemented 14 important machine learning algorithms using Petuum's unified programming interface.

compared to Li Yu's parameter server system, Petuum as a machine learning framework, the shared parameter model is stored in a hash table and is updated with a deferred consistency protocol, which determines that Petuum has 1 to 2 orders of magnitude less than parameter server for the size of the cluster and the number of parameters that can be supported. Of course, compared to the Spark mllib list data store and BSP Sync mode, you can train 2 to 3 orders of magnitude more than the same cluster size. In addition, the spark data processing part and petuum actually not much connection, the foreground, if the processing data direct spark, machine learning directly parameter server good.

Reference documents:

1.Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, Yaoli Ang yu:petuum:a New Platform for distributed machine learning on Big Data. KDD 2015:1335-1344

2.Mu Li, David G. Andersen, June Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, Bor-yiing su:scaling Distributed machine learning with the Parameter Server. OSDI 2014:583-598

How to evaluate Petuum Distributed machine learning system?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.