Gradient iterative tree regression (GBDT) algorithm principle and spark Mllib invocation instance (Scala/java/python) _

Gradient iterative tree regression (GBDT) algorithm principle and spark Mllib invocation instance (Scala/java/python) __ Encoding

Last Update:2018-07-24 Source: Internet

Author: User

Tags pyspark spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Gradient iterative tree regression

Introduction to the algorithm:

Gradient Lifting tree is an integrated algorithm of decision tree. It minimizes the loss function by repeatedly iterating over the training decision tree. The decision tree is similar, the gradient lifting tree has the characteristics of processing class, easy to extend to multiple classification, and no need for feature scaling. SPARK.ML is implemented by using the existing decision tree tool.

The gradient lifting tree iterates through a series of decision trees in turn. In one iteration, the algorithm uses the existing integration to predict the class of each training instance, and then compares the predicted results with the real label values. By tagging, you give a higher weight to an instance of a bad prediction result. Therefore, in the next iteration, the decision tree corrects the previous error.

The mechanism for marking instance labels is specified by the loss function. In each iteration, the gradient iteration tree further reduces the value of the loss function on the training data. SPARK.ML provides a loss function (Log Loss) for the classification problem, which provides two kinds of loss functions (squared error and absolute error) for the regression problem.

The SPARK.ML supports two classification and regression stochastic forest algorithms, which are suitable for continuous features and class characteristics.

* Note Gradient elevation tree currently does not support multiple classification problems.

Parameters:

Checkpointinterval:

Type: integer type.

Meaning: Set checkpoint interval (>=1), or do not set checkpoint (-1).

Featurescol:

Type: String type.

Meaning: A feature column name.

Impurity:

Type: String type.

Meaning: Guidelines for computing information gain (case-insensitive).

Labelcol:

Type: String type.

Meaning: Label column name.

Losstype:

Type: String type.

Meaning: Loss function type.

Maxbins:

Type: integer type.

Meaning: The maximum number of continuous feature discretization, and the way to select the split feature of each node.

MaxDepth:

Type: integer type.

Meaning: The maximum depth of the tree (>=0).

Maxiter:

Type: integer type.

Meaning: Iteration count (>=0).

Mininfogain:

Type: double-precision.

Meaning: The minimum information gain required to split a node.

Mininstancespernode:

Type: integer type.

Meaning: The minimum number of instances that are included in a node since splitting.

Predictioncol:

Type: String type.

Meaning: The forecast result column name.

Seed

Type: Long integral type.

Meaning: Random seeds.

Subsamplingrate:

Type: double-precision.

Meaning: Learn a decision tree using the training data scale, range [0,1].

Stepsize:

Type: double-precision.

Meaning: Optimizes the step size each time iteration.

Example:

In the following example, Gbtregressor only one iteration, and in practice it is impractical.

Scala:

Import org.apache.spark.ml.Pipeline Import org.apache.spark.ml.evaluation.RegressionEvaluator Import Org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.regression.
{Gbtregressionmodel, gbtregressor}//Load and parse the data file, converting it to a dataframe. Val data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt")//automatically identify
Categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Val featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). Setmaxcategories
(4). Fit (data)//Split the "data into training and test sets" (30% held out for testing).
Val Array (trainingdata, testData) = Data.randomsplit (Array (0.7, 0.3))//Train a GBT model. Val GBT = new Gbtregressor (). Setlabelcol ("label"). Setfeaturescol ("Indexedfeatures"). Setmaxiter (a)//Chain Inde
Xer and GBT in a Pipeline.
Val pipeline = new Pipeline ()  . Setstages (Array (Featureindexer, GBT))//Train model.
This also runs the indexer.
Val model = Pipeline.fit (trainingdata)//Make predictions.
Val predictions = Model.transform (testData)//Select example rows to display.
Predictions.select ("Prediction", "label", "features"). Show (5)//Select (prediction, True label) and compute test error. Val evaluator = new Regressionevaluator (). Setlabelcol ("label"). Setpredictioncol ("prediction"). Setmetricname ("Rmse ") Val Rmse = evaluator.evaluate (predictions) println (" Root Mean squared Error (RMSE) on test data = "+ Rmse) Val Gbtmod el = model.stages (1). Asinstanceof[gbtregressionmodel] println ("Learned regression GBT model:\n" + gbtmodel.todebugstring)

Java:

Import Org.apache.spark.ml.Pipeline;
Import Org.apache.spark.ml.PipelineModel;
Import Org.apache.spark.ml.PipelineStage;
Import Org.apache.spark.ml.evaluation.RegressionEvaluator;
Import Org.apache.spark.ml.feature.VectorIndexer;
Import Org.apache.spark.ml.feature.VectorIndexerModel;
Import Org.apache.spark.ml.regression.GBTRegressionModel;
Import Org.apache.spark.ml.regression.GBTRegressor;
Import Org.apache.spark.sql.Dataset;
Import Org.apache.spark.sql.Row;

Import org.apache.spark.sql.SparkSession;
Load and parse the data file, converting it to a dataframe.

dataset<row> data = Spark.read (). Format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt");
Automatically identify categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Vectorindexermodel featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). S

Etmaxcategories (4). Fit (data); Split the data into Training and test sets (30% held out for testing).
Dataset<row>[] Splits = Data.randomsplit (new double[] {0.7, 0.3});
Dataset<row> trainingdata = splits[0];

Dataset<row> testData = splits[1];
Train a GBT model.

Gbtregressor GBT = new Gbtregressor (). Setlabelcol ("label"). Setfeaturescol ("Indexedfeatures"). Setmaxiter (10);
Chain Indexer and GBT in a Pipeline.

Pipeline Pipeline = new Pipeline (). Setstages (New pipelinestage[] {featureindexer, GBT}); Train model.
This also runs the indexer.

Pipelinemodel model = Pipeline.fit (trainingdata);
Make predictions.

dataset<row> predictions = Model.transform (testData);
Select example rows to display.

Predictions.select ("Prediction", "label", "features"). Show (5);
Select (prediction, True label) and compute test error. Regressionevaluator evaluator = new Regressionevaluator (). Setlabelcol ("label"). Setpredictioncol ("prediction"). Set
Metricname ("Rmse"); Double Rmse = evaluator.evaluate (predictions);

System.out.println ("Root Mean squared Error (RMSE) on test data =" + RMSE);
Gbtregressionmodel Gbtmodel = (Gbtregressionmodel) (Model.stages () [1]); System.out.println ("Learned regression GBT model:\n" + gbtmodel.todebugstring ());

Python:

From pyspark.ml import Pipeline to pyspark.ml.regression import gbtregressor from pyspark.ml.feature import Vectorindex
Er from pyspark.ml.evaluation import regressionevaluator # Load and parse the ' data file, converting it to a dataframe. data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt") # automatically identify categorical
Features, and index them.
# Set Maxcategories so features with > 4 distinct values are treated as continuous.  Featureindexer =\ vectorindexer (inputcol= "Features", outputcol= "Indexedfeatures", maxcategories=4). Fit (data) # Split

The data into training and test sets (30% held out for testing) (Trainingdata, testData) = Data.randomsplit ([0.7, 0.3])
# Train a GBT model. GBT = Gbtregressor (featurescol= "Indexedfeatures", maxiter=10) # Chain Indexer and GBT in a Pipeline Pipeline = Pipeline (s  Tages=[featureindexer, GBT]) # Train model.
This also runs the indexer.
Model = Pipeline.fit (trainingdata) # make predictions. predictions = mOdel.transform (testData) # Select example rows to display. Predictions.select ("Prediction", "label", "features"). Show (5) # Select (prediction, True label) and COMPUTE test error EV Aluator = Regressionevaluator (labelcol= "label", predictioncol= "prediction", Metricname= "Rmse") Rmse = Evaluator.evalu Ate (predictions) print ("Root Mean squared Error (RMSE) on test data =%g"% RMSE) Gbtmodel = model.stages[1] Print (gbtmod EL) # Summary only

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More