Random forest and gradient ascending tree of Mllib

Source: Internet
Author: User

Both random forests and GBTS are integrated learning algorithms that implement strong classifiers by integrating multiple decision trees.

The integrated learning approach is a machine learning algorithm that is based on other machine learning algorithms and combines them effectively. The combined algorithm is more powerful and accurate than any of the algorithm models.

Random forest and gradient lift tree (gbts). The main difference between the two is the order in which each tree is trained.

Random forests train each tree individually by randomly sampling the data. This randomness also makes the model more robust than a single decision tree, and it is not easy to produce overfitting on the training set.

Gbts trained only One tree at a time, and each new decision tree in the back gradually corrected the error generated by the decision tree ahead. As the tree is added, the expression of the model is stronger.

Finally, both methods generate a set of weights for a decision tree. The integration model makes predictions by combining the results of each individual tree. Displays a simple instance that is integrated by 3 decision trees.


In the regression set of the above example, each tree predicts a real value. These predicted values are combined to produce the final integrated prediction results. Here, we obtain the final result by means of the mean value (of course different prediction tasks need to use different combinatorial algorithms).


In Mllib, the data for random forests and Gbts are stored as instances (rows). The implementation of the algorithm is based on the original decision tree code, and each decision tree adopts distributed learning.

Random forest: Each tree in a random forest is trained separately, and multiple trees can be trained in parallel (apart from that, the training of each individual tree can be parallelized). Mllib does the same thing: dynamically adjusts the number of sub-trees that can be trained in parallel, based on the constraints of the current iteration memory.

gbts: Because Gbts can only train One tree at a time, the granularity of parallel training can only be achieved by a single tree.

mllib:

    • Memory: Random forest uses a different sample data to train each tree. We save memory by using the data structure of Treepoint to store each sub-sample, instead of directly replicating each of the sampled data.
    • Communication: Although decision trees are often trained by selecting all the functions of each decision point in the tree, random forests tend to limit the selection of a random subset at each node. This sub-sampling feature is fully utilized in the implementation of Mllib to reduce communication: for example, if the value of each node is 1/3, then we will reduce the communication by 1/3.
Random Forest:

Import Org.apache.spark.mllib.tree.RandomForestimport Org.apache.spark.mllib.tree.configuration.Strategyimport Org.apache.spark.mllib.util.MLUtils//Load and parse the data file.val data =mlutils.loadlibsvmfile (SC, "data/mllib/ Sample_libsvm_data.txt ")//Split data into training/test setsval splits = Data.randomsplit (Array (0.7, 0.3)) Val ( Trainingdata, TestData) = (splits (0), splits (1))//Train a randomforest model.val treestrategy = strategy.defaultstrategy ("Classification") Val numtrees = 3//use + in Practice.val featuresubsetstrategy = "Auto"/-Let the algorithm choose. Val model = Randomforest.trainclassifier (Trainingdata,treestrategy, numtrees, featuresubsetstrategy, seed = 12345)// Evaluate model on test instances and compute test Errorval Testerr = testdata.map {point =>val prediction = model.pred ICT (point.features) if (Point.label = = prediction) 1.0 Else 0.0}.mean () println ("Test Error =" + Testerr) println ("Learned R Andom forest:n "+ model.todebugstring)

Gbts:

Import Org.apache.spark.mllib.tree.GradientBoostedTreesimport Org.apache.spark.mllib.tree.configuration.BoostingStrategyimport org.apache.spark.mllib.util.MLUtils//Load and Parse the data file.val data =mlutils.loadlibsvmfile (SC, "data/mllib/sample_libsvm_data.txt")//Split data into training /test Setsval splits = Data.randomsplit (Array (0.7, 0.3)) Val (trainingdata, testData) = (splits (0), splits (1))//Train a G Radientboostedtrees model.val boostingstrategy =boostingstrategy.defaultparams ("Classification") Boostingstrategy.numiterations = 3//Note:use more in Practiceval model =gradientboostedtrees.train (TrainingData, boost Ingstrategy)//Evaluate model on test instances and compute test Errorval Testerr = testdata.map {point =>val predict Ion = Model.predict (point.features) if (Point.label = = prediction) 1.0 Else 0.0}.mean () println ("Test Error =" + testerr) pr Intln ("Learned GBT model:n" + model.todebugstring)

Scalability:

Through the empirical results of the two classification problem, we prove the extensibility of mllib. Each of the following charts compares the characteristics of gbts and random forests, where each tree has a different maximum depth.

These tests are a regression task that predicts the release date of the song from the audio feature (Yearpredictionmsd DataSet from UCI ML repository). We use the EC2 R3.2xlarge machine, the parameters of the algorithm are used unless specifically stated in the default values.

Scaling of model size: training time and test error

The following two charts show the effect of increasing the number of trees on the integration effect. For both gbts and random forests, increasing the number of trees increases the duration of the training (as shown in the first picture), while the increase in the number of trees increases the predictive accuracy (measured by the average mean squared error of the test, as shown in Figure II).

In comparison, random forest training takes less time, but achieving the same predictive accuracy as GBTS requires a deeper tree. Gbts can significantly reduce errors at each iteration, but after too many iterations it is too easy to fit (increasing the test error). Random forests are less prone to overfitting and test errors tend to stabilize.


Below is the graph of the mean square error with a single decision tree depth (depth of 2,5,10, respectively).


Description: 463,715 examples of training. 16 nodes.

Training Set Scaling: Training time and test error

The following two charts show the effect of using different training sets on the results of the algorithm. The graph shows that the larger the data set, the longer the training time of the two methods, but the better test results are produced.



The integration approach can not only integrate decision trees, it can integrate almost all of the classification and regression algorithms.





Random forest and gradient ascending tree of Mllib

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.