Key parameters
Most importantly, there are two parameters that often need to be debugged to improve the algorithm's effectiveness: Numtrees,maxdepth.
- Numtrees (number of decision trees): Increasing the number of decision trees will reduce the variance of the predicted results, so that there will be higher accuracy when testing. The training time has a linear growth relationship with Numtrees.
- MaxDepth: Refers to the maximum possible depth of each decision tree in the forest, which is mentioned in the decision tree. A deeper tree means that the model predicts more powerfully, but at the same time it takes longer to train and more prone to overfitting. It is noteworthy, however, that the requirements for this parameter are not the same for the random forest algorithm and the single decision tree algorithm . Random forest because it is the result of multiple decision tree prediction results of the poll or the average and reduce the variance of the predicted results, so compared to a single decision tree, it is not easy to fit the situation. Therefore, the random forest can choose a larger maxdepth than the decision tree model.
Even some documents say that every decision tree in a random forest is likely to grow without pruning. However, it is advisable to experiment with the maxdepth parameter to see if the effect of the prediction can be improved.
There are also two parameters, subsamplingrate,featuresubsetstrategy generally do not need to debug, but these two parameters can also be reset to speed up training, However, it is important to note that the predictive effect of the model may be affected (if you need to debug it carefully read the following English bar).
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those is covered in the decision tree Guide.
The first and parameters we mention is the most important, and tuning them can often improve performance:
(1) Numtrees:number of trees in the forest.
Increasing the number of trees would decrease the variance in predictions, improving the model ' s test-time accuracy.
Training time increases roughly linearly in the number of trees.
(2) Maxdepth:maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and is also more prone to overfitting.
In general, it was acceptable to train deeper trees when using the random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees I n the forest).
The next parameters generally do not require tuning. However, they can be tuned to speed up training.
(3) Subsamplingrate:this parameter specifies the size of the dataset used for training each tree in the forest, as a fract Ion of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
(4) Featuresubsetstrategy:number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number would speed up training, but can sometimes impact performance if too low.
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those is covered in the decision tree Guide.
"""Random Forest classification Example.""" from __future__ Importprint_function fromPysparkImportSparkcontext#$example on$ fromPyspark.mllib.treeImportRandomforest, Randomforestmodel fromPyspark.mllib.utilImportmlutils#$example off$if __name__=="__main__": SC= Sparkcontext (appname="Pythonrandomforestclassificationexample") #$example on$ #Load and parse the data file into an RDD of labeledpoint.data = Mlutils.loadlibsvmfile (SC,'Data/mllib/sample_libsvm_data.txt') #Split the data into training and test sets (30% held out for testing)(Trainingdata, TestData) = Data.randomsplit ([0.7, 0.3]) #Train a randomforest model. #Empty Categoricalfeaturesinfo indicates all features is continuous. #note:use larger numtrees in practice. #Setting featuresubsetstrategy= "Auto" lets the algorithm choose.Model = Randomforest.trainclassifier (Trainingdata, numclasses=2, categoricalfeaturesinfo={}, Numtrees=3, featuresubsetstrategy="Auto", impurity='Gini', maxdepth=4, maxbins=32) #Evaluate model on test instances and compute test errorpredictions = Model.predict (Testdata.map (Lambdax:x.features)) Labelsandpredictions= Testdata.map (LambdaLp:lp.label). zip (predictions) Testerr= Labelsandpredictions.filter (Lambda(V, p): V! = p). Count ()/Float (testdata.count ())Print('Test Error ='+str (testerr))Print('learned classification forest model:') Print(Model.todebugstring ())#Save and load ModelModel.save (SC,"Target/tmp/myrandomforestclassificationmodel") Samemodel= Randomforestmodel.load (SC,"Target/tmp/myrandomforestclassificationmodel") #$example off$
Model looks like:
Treeensemblemodel classifier with 3Trees Tree 0:if (feature511 <= 0.0) If (feature434 <= 0.0) Predict:0.0Else (Feature434 > 0.0) Predict:1.0Else (Feature511 > 0.0) Predict:0.0Tree1: If (Feature490 <= 31.0) Predict:0.0Else (Feature490 > 31.0) Predict:1.0Tree2: If (Feature302 <= 0.0) If (feature461 <= 0.0) If (feature208 <= 107.0) Predict:1.0Else (Feature208 > 107.0) Predict:0.0Else (Feature461 > 0.0) Predict:1.0Else (Feature302 > 0.0) Predict:0.0
Random forest algorithm demo Python spark