Gradient iterative tree regression (GBDT) algorithm principle and spark Mllib invocation instance (Scala/java/python) __ Encoding

Source: Internet
Author: User
Tags pyspark spark mllib

Gradient iterative tree regression

Introduction to the algorithm:

Gradient Lifting tree is an integrated algorithm of decision tree. It minimizes the loss function by repeatedly iterating over the training decision tree. The decision tree is similar, the gradient lifting tree has the characteristics of processing class, easy to extend to multiple classification, and no need for feature scaling. SPARK.ML is implemented by using the existing decision tree tool.

The gradient lifting tree iterates through a series of decision trees in turn. In one iteration, the algorithm uses the existing integration to predict the class of each training instance, and then compares the predicted results with the real label values. By tagging, you give a higher weight to an instance of a bad prediction result. Therefore, in the next iteration, the decision tree corrects the previous error.

The mechanism for marking instance labels is specified by the loss function. In each iteration, the gradient iteration tree further reduces the value of the loss function on the training data. SPARK.ML provides a loss function (Log Loss) for the classification problem, which provides two kinds of loss functions (squared error and absolute error) for the regression problem.

The SPARK.ML supports two classification and regression stochastic forest algorithms, which are suitable for continuous features and class characteristics.

* Note Gradient elevation tree currently does not support multiple classification problems.

Parameters:

Checkpointinterval:

Type: integer type.

Meaning: Set checkpoint interval (>=1), or do not set checkpoint (-1).

Featurescol:

Type: String type.

Meaning: A feature column name.

Impurity:

Type: String type.

Meaning: Guidelines for computing information gain (case-insensitive).

Labelcol:

Type: String type.

Meaning: Label column name.

Losstype:

Type: String type.

Meaning: Loss function type.

Maxbins:

Type: integer type.

Meaning: The maximum number of continuous feature discretization, and the way to select the split feature of each node.

MaxDepth:

Type: integer type.

Meaning: The maximum depth of the tree (>=0).

Maxiter:

Type: integer type.

Meaning: Iteration count (>=0).

Mininfogain:

Type: double-precision.

Meaning: The minimum information gain required to split a node.

Mininstancespernode:

Type: integer type.

Meaning: The minimum number of instances that are included in a node since splitting.

Predictioncol:

Type: String type.

Meaning: The forecast result column name.

Seed

Type: Long integral type.

Meaning: Random seeds.

Subsamplingrate:

Type: double-precision.

Meaning: Learn a decision tree using the training data scale, range [0,1].

Stepsize:

Type: double-precision.

Meaning: Optimizes the step size each time iteration.

Example:

In the following example, Gbtregressor only one iteration, and in practice it is impractical.

Scala:

Import org.apache.spark.ml.Pipeline Import org.apache.spark.ml.evaluation.RegressionEvaluator Import Org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.regression.
{Gbtregressionmodel, gbtregressor}//Load and parse the data file, converting it to a dataframe. Val data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt")//automatically identify
Categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Val featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). Setmaxcategories
(4). Fit (data)//Split the "data into training and test sets" (30% held out for testing).
Val Array (trainingdata, testData) = Data.randomsplit (Array (0.7, 0.3))//Train a GBT model. Val GBT = new Gbtregressor (). Setlabelcol ("label"). Setfeaturescol ("Indexedfeatures"). Setmaxiter (a)//Chain Inde
Xer and GBT in a Pipeline.
Val pipeline = new Pipeline ()  . Setstages (Array (Featureindexer, GBT))//Train model.
This also runs the indexer.
Val model = Pipeline.fit (trainingdata)//Make predictions.
Val predictions = Model.transform (testData)//Select example rows to display.
Predictions.select ("Prediction", "label", "features"). Show (5)//Select (prediction, True label) and compute test error. Val evaluator = new Regressionevaluator (). Setlabelcol ("label"). Setpredictioncol ("prediction"). Setmetricname ("Rmse ") Val Rmse = evaluator.evaluate (predictions) println (" Root Mean squared Error (RMSE) on test data = "+ Rmse) Val Gbtmod el = model.stages (1). Asinstanceof[gbtregressionmodel] println ("Learned regression GBT model:\n" + gbtmodel.todebugstring)

Java:

Import Org.apache.spark.ml.Pipeline;
Import Org.apache.spark.ml.PipelineModel;
Import Org.apache.spark.ml.PipelineStage;
Import Org.apache.spark.ml.evaluation.RegressionEvaluator;
Import Org.apache.spark.ml.feature.VectorIndexer;
Import Org.apache.spark.ml.feature.VectorIndexerModel;
Import Org.apache.spark.ml.regression.GBTRegressionModel;
Import Org.apache.spark.ml.regression.GBTRegressor;
Import Org.apache.spark.sql.Dataset;
Import Org.apache.spark.sql.Row;

Import org.apache.spark.sql.SparkSession;
Load and parse the data file, converting it to a dataframe.

dataset<row> data = Spark.read (). Format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt");
Automatically identify categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Vectorindexermodel featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). S

Etmaxcategories (4). Fit (data); Split the data into Training and test sets (30% held out for testing).
Dataset<row>[] Splits = Data.randomsplit (new double[] {0.7, 0.3});
Dataset<row> trainingdata = splits[0];

Dataset<row> testData = splits[1];
Train a GBT model.

Gbtregressor GBT = new Gbtregressor (). Setlabelcol ("label"). Setfeaturescol ("Indexedfeatures"). Setmaxiter (10);
Chain Indexer and GBT in a Pipeline.

Pipeline Pipeline = new Pipeline (). Setstages (New pipelinestage[] {featureindexer, GBT}); Train model.
This also runs the indexer.

Pipelinemodel model = Pipeline.fit (trainingdata);
Make predictions.

dataset<row> predictions = Model.transform (testData);
Select example rows to display.

Predictions.select ("Prediction", "label", "features"). Show (5);
Select (prediction, True label) and compute test error. Regressionevaluator evaluator = new Regressionevaluator (). Setlabelcol ("label"). Setpredictioncol ("prediction"). Set
Metricname ("Rmse"); Double Rmse = evaluator.evaluate (predictions);

System.out.println ("Root Mean squared Error (RMSE) on test data =" + RMSE);
Gbtregressionmodel Gbtmodel = (Gbtregressionmodel) (Model.stages () [1]); System.out.println ("Learned regression GBT model:\n" + gbtmodel.todebugstring ());

Python:

From pyspark.ml import Pipeline to pyspark.ml.regression import gbtregressor from pyspark.ml.feature import Vectorindex
Er from pyspark.ml.evaluation import regressionevaluator # Load and parse the ' data file, converting it to a dataframe. data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt") # automatically identify categorical
Features, and index them.
# Set Maxcategories so features with > 4 distinct values are treated as continuous.  Featureindexer =\ vectorindexer (inputcol= "Features", outputcol= "Indexedfeatures", maxcategories=4). Fit (data) # Split

The data into training and test sets (30% held out for testing) (Trainingdata, testData) = Data.randomsplit ([0.7, 0.3])
# Train a GBT model. GBT = Gbtregressor (featurescol= "Indexedfeatures", maxiter=10) # Chain Indexer and GBT in a Pipeline Pipeline = Pipeline (s  Tages=[featureindexer, GBT]) # Train model.
This also runs the indexer.
Model = Pipeline.fit (trainingdata) # make predictions. predictions = mOdel.transform (testData) # Select example rows to display. Predictions.select ("Prediction", "label", "features"). Show (5) # Select (prediction, True label) and COMPUTE test error EV Aluator = Regressionevaluator (labelcol= "label", predictioncol= "prediction", Metricname= "Rmse") Rmse = Evaluator.evalu Ate (predictions) print ("Root Mean squared Error (RMSE) on test data =%g"% RMSE) Gbtmodel = model.stages[1] Print (gbtmod EL) # Summary only


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.