What is "large-scale machine learning"

Source: Internet
Author: User
Tags spark rdd spark mllib

In the massive data scenario, the value of the large-scale machine learning method to fully exploit the data set has been widely used in many companies, and the relevant information has been many, but still not very system.


The door by my permission, reprinted the large-scale machine learning field of 2 excellent practitioners--@ Baigang, @ Yang June answer. It is hoped that the knowledge system in this direction can be more clearly systematized, providing some guidance for practitioners in this field.


@ Baigang


Machine learning is the algorithm that automatically induces logic or rules from the data and predicts it based on the results of the induction and the new data. This direction is data-related, and the previous data collection process is often inefficient and cost-effective, with few large-scale problems to solve. Until the Internet developed, the Web server can record and collect a large number of user access, interaction information, such data collection cost is very low, and has some value. As a result, "large-scale machine learning" became feasible and increasingly important.


Machine learning applications, we are exposed to a variety of problems, algorithms, technology seemingly complex, but mainly can be seen as two aspects:


Determine the representation scheme of the model based on the problem to be modeled and the data of the application

The optimization method of finding the optimal model in the infinite possibilities of representation


"Large-scale machine learning" involves solving both theoretical and engineering problems in the application of large-scale data from both the representation and optimization of the model.


Why "large-scale" is beneficial, than the "small-scale" advantage.


The technical work should be based on the practical use of the fancy instead of the same.

--who who


The goal of model training is to make the model work best when applied to new data, that is, to minimize generalization error. And this generalization error, for supervised learning, theoretically there are two sources: bias and variance.


High bias can be seen as a model with insufficient modeling capability and a large error in the training set-and will tend to be worse on new data, i.e. under-fitting;

The high variance can be seen as a model that fits the training set too well, and the new data will be poor, that is, over-fitting.


So for the model effect, in addition to feature engineering such trick, "tune to a good argument"-to solve good bias and variance tradeoff, but also part of the core competitiveness of algorithmic engineers. But "large-scale machine learning" can solve both of these problems to some extent:


Resolve high variance/over-fitting: Increase the sample set. The introduction of variance can be understood as the sample set is not comprehensive, and the distribution of training samples is inconsistent with the distribution of actual data. By enlarging the sample set and even using the whole amount of data, it is possible to make the training set consistent with the data set of the application model and reduce the impact of variance.

Resolve high bias/under-fitting: Increase the complexity of the model. By introducing more features and more complex structures, the model can describe the probability distribution/interface/rule logic more comprehensively, thus having a better effect.


So when the Boss/Technical committee/next door team of the critical little expert/ignorant onlookers put forward dozens of hundreds of compute nodes/a few terabytes of data huff and puff not green environmental protection do not want to make a big news, to the profound theoretical basis and considerable business benefits to persuade everyone: the practical use, not the pursuit of fancy.


Representation and optimization of models


From the two aspects of model representation and optimization, the more formalized definition of machine learning solving problem, the learning process probably can be described as follows:




Which is representation, or model parameters; D is data; Σ (·) and Ω (·) Loss/risk/objective and prior/regularization/structural knowledge are respectively.


The process of optimization, which is said earlier, is often the iterative optimization process. Some can be converted to optimization problems, through numerical optimization methods, such as batch lbfgs, or online Sgd/ftrl and so on for parameter estimation, some are not able to solve the most solvable optimization problems, the conversion to the probability distribution of the estimation problem, through the probabilistic Inference to solve--such as using Gibbs sampling to train latent Dirichlet allocation model.


Whether it is numerical optimization or sampling, it is the process of iterative optimization:




Do two things every step of the iteration. The first is the evaluation of the current model in D on the data set, the amount of a "possible deviation from the good model", and the second is the process of correcting the model based on the amount of deviation.


Large-scale machine learning is the theoretical and engineering problems that are introduced when the scale of D and is very large. "Knowledge system" can be two aspects of the arrangement, on the one hand is the macroscopic structure, on the other hand is the innumerable microscopic trick. The two steps in the iteration determine what the architecture on the macro should look like. The computation within two steps and the interaction between data introduces techniques and trick to solve multiple specific problems.


First, D and need to be distributed across multiple compute nodes.


Assuming that the data is distributed on N data node nodes, the model is distributed on the M-Model node node. The data/sample set is only related to the first step, which should be done on the node where the data resides, and the second is an update to the model, which should be done at the nodes where the model resides. In other words, there are two roles in this architecture, each with its own computational logic. And in the distributed system, a set of replica mechanism is needed to fault tolerance.


Almost all of the platforms and systems for large-scale machine learning can be seen as being composed of these two roles. In Spark Mllib, driver program is the worker node of model Node,executor, which is data node, and in Vowpal Wabbit, the Mapper ID 0 node plays the model node, all mapper play the role of data node, and in Parameter server, the server is model Node,worker is data node. The model node of both Mllib and VW is a single node, and PS can extend it to multiple nodes.


Second, data must be transferred between nodes. Data node needs to get the current model before computing, and model node needs to be updated.


Mllib through the Treeaggregate interface of the RDD, VW passes the Allreduce, both of which are actually the same way. Nodes are allocated in a tree-like structure, and each node calculates its own portion of the update, summarizing the results of its child nodes and passing it to its parent node. The final root node (model node) Gets the total amount of updates. The Rabit in DMLC provides a more concise interface and a more robust implementation of this type of aggregation data transfer. We applied rabit to implement a set of multi-task learning to train the ad click-through model on different ad positions.


PS is to pull the part that it needs through the worker (data node), calculate the update amount and push it to the server (model node), the server to do the corresponding model update. This seems to be simpler than the front aggregation, but when each data node progresses too much, a new problem is introduced.


So the 3rd issue involves parallelism and consistency. The state of each node in a distributed system is different, and the progress of the calculation will be different. When a data node calculates an update amount that corresponds to a number of iterations, it does not necessarily contribute to the optimization, and may even affect convergence.


The iterative calculations on VW and Mllib can be seen as synchronous. All of the Model node gets a summary of all the data node, updates it, and broadcasts it to all data node. Each iteration of model node gets a full update volume, and data node has a consistent model. So here is the equivalent of setting the barrier, and so everyone to reach the end of this round, then let everyone open the next round.


Here the tradeoff is: completely do not set barrier, that is, pure asynchronous processing, will allow the system to advance as soon as possible, but the progress of the differences between the nodes too much to cause the convergence of the optimization solution instability; each round of iterations is set to barrier, That is bulk synchronous processing, will ensure the consistency of the data, but will be in each round waiting for slow nodes, resulting in inefficient problems.


So a compromise scheme is to set the barrier, when not each round is set, but a time window length is limited. The fastest node to calculate will also wait until all the updates have been applied to the next round of Pull&compute&push. This mechanism is also called bounded delay asynchronous or stale synchronous processing.


In general, to solve the above problems, to build a large-scale machine learning system is not space science and technology. But these require rigorous engineering skills and a lot of ingenious trick. All of this comes from an in-depth understanding and exploration of the problem, as well as full practice. The former helps to rationalize the structure of abstraction and define interfaces, while the latter helps to solve obstacles in implementation, resulting in more efficient and stable systems.


I have tried to implement some large-scale machine learning algorithms on Apache Spark:

Https://github.com/BaiGang/spark_multiboost, and gradually discovered that spark Rdd's abstraction of this data flow can support some of the requirements of machine learning algorithms, But its overly advanced interface encapsulation limits the more targeted optimizations at the bottom.


I was later involved in a set of distributed computing frameworks and the development of machine learning algorithms on them: Https://github.com/taskgraph/taskgraph. Go Lang has a good interface to support parallel computing and networking, and is well suited as a system programming language. But in the case of dense, purely CPU-intensive numerical computations, the machine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.