Principle of stochastic forest (Random Forest) algorithm and spark Mllib invocation instance (Scala/java/python) __ Encoding

Source: Internet
Author: User
Tags pyspark spark mllib

Random forest classifier:

Introduction to the algorithm:

Stochastic forest is an integrated algorithm of decision tree. Random forests contain multiple decision trees to reduce the risk of fitting. Stochastic forest has the characteristics of easy to explain, can deal with category, easy to expand to multiple classifications, and need not feature scaling.

Random forests train a series of decision trees separately, so the training process is parallel. By adding stochastic processes to the algorithm, there is a small difference in each decision tree. By merging the forecast results of each tree to reduce the predicted variance, the performance of the test set is improved.

Random embodiment:
1. For each iteration, a secondary sampling of the original data is carried out to obtain different training data.

2. For each tree node, consider a different subset of random features to split.

In addition, the training process is the same as the individual decision tree training process.

When the new example is forecasted, the stochastic forest needs to integrate the prediction results of each decision tree. There is a slight difference in the way regression and classification problems are integrated. The classification question takes the voting system, each decision tree votes to a category, obtains the most votes category as the final result. Regression problem the predicted results for each tree are real numbers, and the final prediction results are the average of the predicted results of each tree.

The SPARK.ML supports two classification, multiple classification and regression stochastic forest algorithms, which are suitable for continuous features and class characteristics.

Parameters:

Checkpointinterval:

Type: integer type.

Meaning: Set checkpoint interval (>=1), or do not set checkpoint (-1).

Featuresubsetstrategy:

Type: String type.

Meaning: The number of candidate features per split.

Featurescol:

Type: String type.

Meaning: A feature column name.

Impurity:

Type: String type.

Meaning: Guidelines for computing information gain (case-insensitive).

Labelcol:

Type: String type.

Meaning: Label column name.

Maxbins:

Type: integer type.

Meaning: The maximum number of continuous feature discretization, and the way to select the split feature of each node.

MaxDepth:

Type: integer type.

Meaning: The maximum depth of the tree (>=0).

Mininfogain:

Type: double-precision.

Meaning: The minimum information gain required to split a node.

Mininstancespernode:

Type: integer type.

Meaning: The minimum number of instances that are included in a node since splitting.

Numtrees:

Type: integer type.

Meaning: The number of trained trees.

Predictioncol:

Type: String type.

Meaning: The forecast result column name.

Probabilitycol:

Type: String type.

Meaning: Category conditional probability forecast result column name.

Rawpredictioncol:

Type: String type.

Meaning: Original forecast.

Seed

Type: Long integral type.

Meaning: Random seeds.

Subsamplingrate:

Type: double-precision.

Meaning: Learn a decision tree using the training data scale, range [0,1].

Thresholds:

Type: double array type.

Meaning: Multiple classifications predict the thresholds to adjust the probability of the predicted results in each category.

Example:

The following example imports the LIBSVM format data and divides it into training data and test data. Use the first part of the data for training, leaving the data to test. Before training we used two data preprocessing methods to transform the features and added metadata to Dataframe.

Scala:

Import org.apache.spark.ml.Pipeline Import org.apache.spark.ml.classification. {Randomforestclassificationmodel, randomforestclassifier} Import Org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.
{indextostring, stringindexer, vectorindexer}//Load and parse the data file, converting it to a dataframe.  Val data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt")//Index labels, adding metadata to the
Label column.
Fit on whole dataset to include all labels in index. Val labelindexer = new Stringindexer (). Setinputcol ("label"). Setoutputcol ("Indexedlabel"). Fit (data)//automatical
Ly identify categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Val featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). Setmaxcategories (4). Fit (data)//Split the data into training and test sets (30% held Out for testing).
Val Array (trainingdata, testData) = Data.randomsplit (Array (0.7, 0.3))//Train a randomforest model. Val rf = new Randomforestclassifier (). Setlabelcol ("Indexedlabel"). Setfeaturescol ("Indexedfeatures"). Setnumtrees (1
0)//Convert indexed labels back to original labels. Val labelconverter = new Indextostring (). Setinputcol ("prediction"). Setoutputcol ("Predictedlabel"). Setlabels (Label
Indexer.labels)//Chain indexers and forest in a Pipeline. Val pipeline = new Pipeline (). Setstages (Array (labelindexer, featureindexer, RF, Labelconverter))//Train model.
This also runs the indexers.
Val model = Pipeline.fit (trainingdata)//Make predictions.
Val predictions = Model.transform (testData)//Select example rows to display. Predictions.select ("Predictedlabel", "label", "features"). Show (5)//Select (prediction, True label) and compute test ERR
Or. Val evaluator = new Multiclassclassificationevaluator (). Setlabelcol ("Indexedlabel"). Setpredictioncol("prediction"). Setmetricname ("accuracy") val accuracy = evaluator.evaluate (predictions) println ("Test Error =" + (1.0 -accuracy)) Val Rfmodel = Model.stages (2). Asinstanceof[randomforestclassificationmodel] println ("learned Classification Forest model:\n "+ rfmodel.todebugstring)
Java:

Import Org.apache.spark.ml.Pipeline;
Import Org.apache.spark.ml.PipelineModel;
Import Org.apache.spark.ml.PipelineStage;
Import Org.apache.spark.ml.classification.RandomForestClassificationModel;
Import Org.apache.spark.ml.classification.RandomForestClassifier;
Import Org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
Import org.apache.spark.ml.feature.*;
Import Org.apache.spark.sql.Dataset;
Import Org.apache.spark.sql.Row;

Import org.apache.spark.sql.SparkSession;
Load and parse the data file, converting it to a dataframe.

dataset<row> data = Spark.read (). Format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt");
Index labels, adding metadata to the label column.
Fit on whole dataset to include all labels in index. Stringindexermodel labelindexer = new Stringindexer (). Setinputcol ("label"). Setoutputcol ("Indexedlabel"). Fit (data)
;
Automatically identify categorical features, and index them. Set maxcategories so features with > 4 distinct Values are treated as continuous. Vectorindexermodel featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). S

Etmaxcategories (4). Fit (data); Split the data into training and test sets (30% held out for testing) dataset<row>[] splits = data.randomsplit (NE
W double[] {0.7, 0.3});
Dataset<row> trainingdata = splits[0];

Dataset<row> testData = splits[1];
Train a randomforest model. Randomforestclassifier RF = new Randomforestclassifier (). Setlabelcol ("Indexedlabel"). Setfeaturescol ("

Indexedfeatures ");
Convert indexed labels back to original labels. indextostring labelconverter = new Indextostring (). Setinputcol ("prediction"). Setoutputcol ("Predictedlabel"). SetLa

BELs (Labelindexer.labels ()); Chain indexers and forest in a Pipeline Pipeline Pipeline = new Pipeline (). Setstages (New pipelinestage[] {labelindex

Er, featureindexer, RF, labelconverter}); Train model.
This also runs the indexers. PipelInemodel model = Pipeline.fit (trainingdata);
Make predictions.

dataset<row> predictions = Model.transform (testData);
Select example rows to display.

Predictions.select ("Predictedlabel", "label", "features"). Show (5); Select (prediction, True label) and COMPUTE test error multiclassclassificationevaluator evaluator = new Multiclassclas
Sificationevaluator (). Setlabelcol ("Indexedlabel"). Setpredictioncol ("prediction"). Setmetricname ("accuracy");
Double accuracy = evaluator.evaluate (predictions);

System.out.println ("Test Error =" + (1.0-accuracy));
Randomforestclassificationmodel Rfmodel = (Randomforestclassificationmodel) (Model.stages () [2]); System.out.println ("Learned classification forest model:\n" + rfmodel.todebugstring ());
Python:

From pyspark.ml import Pipeline to pyspark.ml.classification import Randomforestclassifier from Pyspark.ml.feature  Import Stringindexer, vectorindexer from pyspark.ml.evaluation import Multiclassclassificationevaluator # Load and Parse
The data file, converting it to a dataframe. data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt") # Index labels, adding metadata to the Labe
L column.
# Fit on whole dataset to include all labels in index. Labelindexer = Stringindexer (inputcol= "label", outputcol= "Indexedlabel"). Fit (data) # automatically identify
Categorical features, and index them.
# Set Maxcategories so features with > 4 distinct values are treated as continuous.  Featureindexer =\ vectorindexer (inputcol= "Features", outputcol= "Indexedfeatures", maxcategories=4). Fit (data) # Split

The data into training and test sets (30% held out for testing) (Trainingdata, testData) = Data.randomsplit ([0.7, 0.3])
# Train a randomforest model. RF = RandomforestclassifIER (labelcol= "Indexedlabel", featurescol= "Indexedfeatures", numtrees=10) # Chain indexers and forest in a Pipeline Pipeli  NE = Pipeline (stages=[labelindexer, featureindexer, RF]) # Train model.
This also runs the indexers.
Model = Pipeline.fit (trainingdata) # make predictions.
predictions = Model.transform (testData) # Select example rows to display. Predictions.select ("Prediction", "Indexedlabel", "features"). Show (5) # Select (prediction, True label) and COMPUTE test E Rror evaluator = Multiclassclassificationevaluator (labelcol= "Indexedlabel", predictioncol= "prediction", metricName= "accuracy") accuracy = evaluator.evaluate (predictions) print ("Test Error =%g"% (1.0-accuracy)) Rfmodel = Model.stages [2] Print (Rfmodel) # Summary only


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.