Spark Random forest Realization Learning

Last Update:2015-05-03 Source: Internet

Author: User

Tags spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

Recently read the implementation of the random forest in Spark mllib (version: Spark 1.3) and found that when implementing an iterative algorithm on a distributed data structure, some places are different from the standalone environment. Some intuitive operations on a single machine (recursion), on distributed data, must be optimized, otherwise I/O (network, disk) consumes a lot of time. This paper deals with the techniques of Spark random forest implementation, which is convenient for later review.

Summary of random forest algorithms

The detailed implementation and details of random forest algorithms can be referenced in paper Breiman 2001. Here is a brief talk about the general idea, convenient to understand the code.

The random forest is an assembled (ensemble models) model, with internal models using decision trees. The basic idea is to generate a lot of decision trees (forming the Forest ), and finally the final result is decided by the number of these decisions. During the spanning tree, a random process is added from row and column two directions. Line direction, before building each tree, using a back-up sampling (called bootstrapping), to get training data. Column direction, each time you select a segmentation point, the feature is randomly sampled to get a subset of feature, and on the current node, only those subsets corresponding to the data are used to calculate the optimal segmentation point. This is also why this algorithm is called Random forest and is not very intuitive. Compared to a single decision tree, random forests have the following advantages:

The results are relatively stable and not easy to fit.
Out-of-bag error evaluates model effects without cross-examination;
can get feature importance.

Of course, in order to get the advantages above, we have to pay the cost of calculation. In the stand-alone era, using random forests (R or Scikit-learn) is often expensive, but now with spark, large-scale, distributed iterative computing is possible, so using random forest on Spark is the corollary of technological development!

Spark for optimization

When spark implements a random forest, the following optimization strategies are used:

Segmentation Point Sampling
Feature Packing (BIN)
Partition statistics
Layered Computing (level-wise)

These policies are used because the RDD data is distributed on different servers, and to avoid excessive I/O, some optimizations must be made on the original algorithm, otherwise the execution time may be unacceptable. These three optimization strategies are discussed in detail below.

Segmentation Point Sampling

This optimization is primarily for continuous variables. First of all, recall how the general decision tree is the selection of continuous variable segmentation points. In general, the feature is sorted first, then the points between the adjacent two data are selected as the segmentation points. If you perform this operation on the RDD, you will inevitably use the shuffle process, which can lead to a lot of network traffic. Also, the data on the average rdd is large, less than millions of, and hundreds of millions of to billions of or more. In this order of magnitude, it is also drunk to think about. Therefore, in order to avoid the sorting operation, the mllib is sorted on the sample by sampling method, and the segmentation point is obtained according to the sample. According to the Spark team, the use of this strategy at the expense of some precision, but in the actual use of the process, has not caused too much impact, the model effect can be accepted.

Feature packing

According to the sampling, after the segmentation point, the next is the feature boxing operation, the box is composed of adjacent sample segmentation points. The number of boxes is very small, the general practice of the use of about 30. Calculating the different kinds of accounts in each box can quickly calculate the optimal segmentation point.

For example, referring to the example data above, the first line is the scale statistic for each shard point. Based on the above data, it is possible to generate 3 slices, respectively, with Brown, red, and green three lines. If you need to calculate the brown segmentation, just follow the first line of the combination, you can quickly calculate the out.

Partition statistics

After the boxed data in the RDD partition is counted separately, the data for each partition can be combined by reduce to get the overall boxed data (partitioned by mappartition). It is because the boxing statistics can be merged, so it can be well adapted to the distributed Data Environment, the final need to merge the data is just some statistics, do not bring a lot of network communication overhead.

level by layer Calculation

The decision number generation process of a stand-alone version constructs a tree by recursive invocation (essentially depth-first), in which colleagues in the construction tree need to move the data and move the data of the same child node together. This method cannot be executed efficiently on distributed data structures and cannot be executed because the data is too large to be put together, so it is distributed in storage. The strategy adopted by Mlib is to build the tree nodes layer by level (essentially breadth first) so that the number of times to traverse all the data equals the maximum number of layers for all trees. Each time you traverse, you only need to calculate the packing statistic parameters of all feature of each node, and after traversing, decide whether to slice and slice according to the node packing statistic.

These are the key techniques of the random forest implemented by Spark Mllib. Of course there are many implementation details that are not described here, but if you understand these techniques, it can be helpful to read the spark mllib random forest source code, and hopefully it will be useful to the reader.

The lack of implementation of Spark Randomforest

The random forest as of spark 1.3,mllib still does not support OOB error and variable importance support, and some netizens have consulted this issue in the spark community, but have not received an official response at this time. Hopefully, spark can support this feature later.

Resources

Random Forest author papers
Spark Source code
Spark Summit sharing on the implementation of the distributed decision tree

Spark Random forest Realization Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More