Gradient iterative tree regression
Introduction to the algorithm:
Gradient Lifting tree is an integrated algorithm of decision tree. It minimizes the loss function by repeatedly iterating over the training decision tree. The decision tree is similar, the gradient lifting tree has the characteristics of processing class, easy to extend to multiple classification, and no need for feature scaling. SPARK.ML is implemented by using the existing decision tree tool.
The gradient lifting tree iterates through a series of decision trees in turn. In one iteration, the algorithm uses the existing integration to predict the class of each training instance, and then compares the predicted results with the real label values. By tagging, you give a higher weight to an instance of a bad prediction result. Therefore, in the next iteration, the decision tree corrects the previous error.
The mechanism for marking instance labels is specified by the loss function. In each iteration, the gradient iteration tree further reduces the value of the loss function on the training data. SPARK.ML provides a loss function (Log Loss) for the classification problem, which provides two kinds of loss functions (squared error and absolute error) for the regression problem.
The SPARK.ML supports two classification and regression stochastic forest algorithms, which are suitable for continuous features and class characteristics.
* Note Gradient elevation tree currently does not support multiple classification problems.
Parameters:
Checkpointinterval:
Type: integer type.
Meaning: Set checkpoint interval (>=1), or do not set checkpoint (-1).
Featurescol:
Type: String type.
Meaning: A feature column name.
Impurity:
Type: String type.
Meaning: Guidelines for computing information gain (case-insensitive).
Labelcol:
Type: String type.
Meaning: Label column name.
Losstype:
Type: String type.
Meaning: Loss function type.
Maxbins:
Type: integer type.
Meaning: The maximum number of continuous feature discretization, and the way to select the split feature of each node.
MaxDepth:
Type: integer type.
Meaning: The maximum depth of the tree (>=0).
Maxiter:
Type: integer type.
Meaning: Iteration count (>=0).
Mininfogain:
Type: double-precision.
Meaning: The minimum information gain required to split a node.
Mininstancespernode:
Type: integer type.
Meaning: The minimum number of instances that are included in a node since splitting.
Predictioncol:
Type: String type.
Meaning: The forecast result column name.
Seed
Type: Long integral type.
Meaning: Random seeds.
Subsamplingrate:
Type: double-precision.
Meaning: Learn a decision tree using the training data scale, range [0,1].
Stepsize:
Type: double-precision.
Meaning: Optimizes the step size each time iteration.
Example:
In the following example, Gbtregressor only one iteration, and in practice it is impractical.
Scala:
Import org.apache.spark.ml.Pipeline Import org.apache.spark.ml.evaluation.RegressionEvaluator Import Org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.regression.
{Gbtregressionmodel, gbtregressor}//Load and parse the data file, converting it to a dataframe. Val data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt")//automatically identify
Categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Val featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). Setmaxcategories
(4). Fit (data)//Split the "data into training and test sets" (30% held out for testing).
Val Array (trainingdata, testData) = Data.randomsplit (Array (0.7, 0.3))//Train a GBT model. Val GBT = new Gbtregressor (). Setlabelcol ("label"). Setfeaturescol ("Indexedfeatures"). Setmaxiter (a)//Chain Inde
Xer and GBT in a Pipeline.
Val pipeline = new Pipeline () . Setstages (Array (Featureindexer, GBT))//Train model.
This also runs the indexer.
Val model = Pipeline.fit (trainingdata)//Make predictions.
Val predictions = Model.transform (testData)//Select example rows to display.
Predictions.select ("Prediction", "label", "features"). Show (5)//Select (prediction, True label) and compute test error. Val evaluator = new Regressionevaluator (). Setlabelcol ("label"). Setpredictioncol ("prediction"). Setmetricname ("Rmse ") Val Rmse = evaluator.evaluate (predictions) println (" Root Mean squared Error (RMSE) on test data = "+ Rmse) Val Gbtmod el = model.stages (1). Asinstanceof[gbtregressionmodel] println ("Learned regression GBT model:\n" + gbtmodel.todebugstring)
Java:
Import Org.apache.spark.ml.Pipeline;
Import Org.apache.spark.ml.PipelineModel;
Import Org.apache.spark.ml.PipelineStage;
Import Org.apache.spark.ml.evaluation.RegressionEvaluator;
Import Org.apache.spark.ml.feature.VectorIndexer;
Import Org.apache.spark.ml.feature.VectorIndexerModel;
Import Org.apache.spark.ml.regression.GBTRegressionModel;
Import Org.apache.spark.ml.regression.GBTRegressor;
Import Org.apache.spark.sql.Dataset;
Import Org.apache.spark.sql.Row;
Import org.apache.spark.sql.SparkSession;
Load and parse the data file, converting it to a dataframe.
dataset<row> data = Spark.read (). Format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt");
Automatically identify categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Vectorindexermodel featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). S
Etmaxcategories (4). Fit (data); Split the data into Training and test sets (30% held out for testing).
Dataset<row>[] Splits = Data.randomsplit (new double[] {0.7, 0.3});
Dataset<row> trainingdata = splits[0];
Dataset<row> testData = splits[1];
Train a GBT model.
Gbtregressor GBT = new Gbtregressor (). Setlabelcol ("label"). Setfeaturescol ("Indexedfeatures"). Setmaxiter (10);
Chain Indexer and GBT in a Pipeline.
Pipeline Pipeline = new Pipeline (). Setstages (New pipelinestage[] {featureindexer, GBT}); Train model.
This also runs the indexer.
Pipelinemodel model = Pipeline.fit (trainingdata);
Make predictions.
dataset<row> predictions = Model.transform (testData);
Select example rows to display.
Predictions.select ("Prediction", "label", "features"). Show (5);
Select (prediction, True label) and compute test error. Regressionevaluator evaluator = new Regressionevaluator (). Setlabelcol ("label"). Setpredictioncol ("prediction"). Set
Metricname ("Rmse"); Double Rmse = evaluator.evaluate (predictions);
System.out.println ("Root Mean squared Error (RMSE) on test data =" + RMSE);
Gbtregressionmodel Gbtmodel = (Gbtregressionmodel) (Model.stages () [1]); System.out.println ("Learned regression GBT model:\n" + gbtmodel.todebugstring ());
Python:
From pyspark.ml import Pipeline to pyspark.ml.regression import gbtregressor from pyspark.ml.feature import Vectorindex
Er from pyspark.ml.evaluation import regressionevaluator # Load and parse the ' data file, converting it to a dataframe. data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt") # automatically identify categorical
Features, and index them.
# Set Maxcategories so features with > 4 distinct values are treated as continuous. Featureindexer =\ vectorindexer (inputcol= "Features", outputcol= "Indexedfeatures", maxcategories=4). Fit (data) # Split
The data into training and test sets (30% held out for testing) (Trainingdata, testData) = Data.randomsplit ([0.7, 0.3])
# Train a GBT model. GBT = Gbtregressor (featurescol= "Indexedfeatures", maxiter=10) # Chain Indexer and GBT in a Pipeline Pipeline = Pipeline (s Tages=[featureindexer, GBT]) # Train model.
This also runs the indexer.
Model = Pipeline.fit (trainingdata) # make predictions. predictions = mOdel.transform (testData) # Select example rows to display. Predictions.select ("Prediction", "label", "features"). Show (5) # Select (prediction, True label) and COMPUTE test error EV Aluator = Regressionevaluator (labelcol= "label", predictioncol= "prediction", Metricname= "Rmse") Rmse = Evaluator.evalu Ate (predictions) print ("Root Mean squared Error (RMSE) on test data =%g"% RMSE) Gbtmodel = model.stages[1] Print (gbtmod EL) # Summary only