Gradient iterative tree (GBDT) algorithm principle and spark Mllib invocation instance (Scala/java/python) __ Encoding

Source: Internet
Author: User
Tags pyspark spark mllib

Gradient Iterative Tree

Introduction to the algorithm:

Gradient Lifting tree is an integrated algorithm of decision tree. It minimizes the loss function by repeatedly iterating over the training decision tree. The decision tree is similar, the gradient lifting tree has the characteristics of processing class, easy to extend to multiple classification, and no need for feature scaling. SPARK.ML is implemented by using the existing decision tree tool.

The gradient lifting tree iterates through a series of decision trees in turn. In one iteration, the algorithm uses the existing integration to predict the class of each training instance, and then compares the predicted results with the real label values. By tagging, you give a higher weight to an instance of a bad prediction result. Therefore, in the next iteration, the decision tree corrects the previous error.

The mechanism for marking instance labels is specified by the loss function. In each iteration, the gradient iteration tree further reduces the value of the loss function on the training data. SPARK.ML provides a loss function (Log Loss) for the classification problem, which provides two kinds of loss functions (squared error and absolute error) for the regression problem.

The SPARK.ML supports two classification and regression stochastic forest algorithms, which are suitable for continuous features and class characteristics.

* Note Gradient elevation tree currently does not support multiple classification problems.

Parameters:

Checkpointinterval:

Type: integer type.

Meaning: Set checkpoint interval (>=1), or do not set checkpoint (-1).

Featurescol:

Type: String type.

Meaning: A feature column name.

Impurity:

Type: String type.

Meaning: Guidelines for computing information gain (case-insensitive).

Labelcol:

Type: String type.

Meaning: Label column name.

Losstype:

Type: String type.

Meaning: Loss function type.

Maxbins:

Type: integer type.

Meaning: The maximum number of continuous feature discretization, and the way to select the split feature of each node.

MaxDepth:

Type: integer type.

Meaning: The maximum depth of the tree (>=0).

Maxiter:

Type: integer type.

Meaning: Iteration count (>=0).

Mininfogain:

Type: double-precision.

Meaning: The minimum information gain required to split a node.

Mininstancespernode:

Type: integer type.

Meaning: The minimum number of instances that are included in a node since splitting.

Predictioncol:

Type: String type.

Meaning: The forecast result column name.

Rawpredictioncol:

Type: String type.

Meaning: Original forecast.

Seed

Type: Long integral type.

Meaning: Random seeds.

Subsamplingrate:

Type: double-precision.

Meaning: Learn a decision tree using the training data scale, range [0,1].

Stepsize:

Type: double-precision.

Meaning: Optimizes the step size each time iteration.

Example:

The following example imports the LIBSVM format data and divides it into training data and test data. Use the first part of the data for training, leaving the data to test. Before training we used two data preprocessing methods to transform the features and added metadata to Dataframe.

Scala:

Import org.apache.spark.ml.Pipeline Import org.apache.spark.ml.classification.
{Gbtclassificationmodel, gbtclassifier} import Org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator Import Org.apache.spark.ml.feature.
{indextostring, stringindexer, vectorindexer}//Load and parse the data file, converting it to a dataframe.  Val data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt")//Index labels, adding metadata to the
Label column.
Fit on whole dataset to include all labels in index. Val labelindexer = new Stringindexer (). Setinputcol ("label"). Setoutputcol ("Indexedlabel"). Fit (data)//automatical
Ly identify categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Val featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). Setmaxcategories (4). Fit (data)//Split the "data into training and test sets" (30% held out for testing).
Val Array (trainingdata, testData) = Data.randomsplit (Array (0.7, 0.3))//Train a GBT model. Val GBT = new Gbtclassifier (). Setlabelcol ("Indexedlabel"). Setfeaturescol ("Indexedfeatures"). Setmaxiter//Co
Nvert indexed labels back to original labels. Val labelconverter = new Indextostring (). Setinputcol ("prediction"). Setoutputcol ("Predictedlabel"). Setlabels (Label
Indexer.labels)//Chain indexers and GBT in a Pipeline. Val pipeline = new Pipeline (). Setstages (Array (Labelindexer, Featureindexer, GBT, Labelconverter))//Train model.
This also runs the indexers.
Val model = Pipeline.fit (trainingdata)//Make predictions.
Val predictions = Model.transform (testData)//Select example rows to display. Predictions.select ("Predictedlabel", "label", "features"). Show (5)//Select (prediction, True label) and compute test ERR
Or.
  Val evaluator = new Multiclassclassificationevaluator (). Setlabelcol ("Indexedlabel"). Setpredictioncol ("prediction") . Setmetricname ("AccurAcy ") Val accuracy = evaluator.evaluate (predictions) println (" Test Error = "+ (1.0-accuracy)) Val Gbtmodel = Model.sta GES (2). Asinstanceof[gbtclassificationmodel] println ("learned classification GBT model:\n" + gbtmodel.todebugstring)

Java:

Import Org.apache.spark.ml.Pipeline;
Import Org.apache.spark.ml.PipelineModel;
Import Org.apache.spark.ml.PipelineStage;
Import Org.apache.spark.ml.classification.GBTClassificationModel;
Import Org.apache.spark.ml.classification.GBTClassifier;
Import Org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
Import org.apache.spark.ml.feature.*;
Import Org.apache.spark.sql.Dataset;
Import Org.apache.spark.sql.Row;

Import org.apache.spark.sql.SparkSession;
Load and parse the data file, converting it to a dataframe.

dataset<row> data = Spark. Read (). Format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt");
Index labels, adding metadata to the label column.
Fit on whole dataset to include all labels in index. Stringindexermodel labelindexer = new Stringindexer (). Setinputcol ("label"). Setoutputcol ("Indexedlabel"). Fit (data)
;
Automatically identify categorical features, and index them. Set maxcategories so features with > 4 distinct values ARe treated as continuous. Vectorindexermodel featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). S

Etmaxcategories (4). Fit (data); Split the data into training and test sets (30% held out for testing) dataset<row>[] splits = data.randomsplit (NE
W double[] {0.7, 0.3});
Dataset<row> trainingdata = splits[0];

Dataset<row> testData = splits[1];
Train a GBT model. Gbtclassifier GBT = new Gbtclassifier (). Setlabelcol ("Indexedlabel"). Setfeaturescol ("Indexedfeatures"). SetMaxIter (

10);
Convert indexed labels back to original labels. indextostring labelconverter = new Indextostring (). Setinputcol ("prediction"). Setoutputcol ("Predictedlabel"). SetLa

BELs (Labelindexer.labels ());
Chain indexers and GBT in a Pipeline.

Pipeline Pipeline = new Pipeline (). Setstages (New pipelinestage[] {labelindexer, featureindexer, GBT, labelconverter}); Train model.
This also runs the indexers. Pipelinemodel model = PIpeline.fit (Trainingdata);
Make predictions.

dataset<row> predictions = Model.transform (testData);
Select example rows to display.

Predictions.select ("Predictedlabel", "label", "features"). Show (5);
Select (prediction, True label) and compute test error. Multiclassclassificationevaluator evaluator = new Multiclassclassificationevaluator (). SetLabelCol ("IndexedLabel").
Setpredictioncol ("prediction"). Setmetricname ("accuracy");
Double accuracy = evaluator.evaluate (predictions);

System.out.println ("Test Error =" + (1.0-accuracy));
Gbtclassificationmodel Gbtmodel = (Gbtclassificationmodel) (Model.stages () [2]); System.out.println ("Learned classification GBT model:\n" + gbtmodel.todebugstring ());

Python:

From pyspark.ml import Pipeline to pyspark.ml.classification import gbtclassifier from pyspark.ml.feature import String Indexer, vectorindexer from pyspark.ml.evaluation import Multiclassclassificationevaluator # Load and parse the data file
, converting it to a dataframe. data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt") # Index labels, adding metadata to the Labe
L column.
# Fit on whole dataset to include all labels in index. Labelindexer = Stringindexer (inputcol= "label", outputcol= "Indexedlabel"). Fit (data) # automatically identify
Categorical features, and index them.
# Set Maxcategories so features with > 4 distinct values are treated as continuous.  Featureindexer =\ vectorindexer (inputcol= "Features", outputcol= "Indexedfeatures", maxcategories=4). Fit (data) # Split

The data into training and test sets (30% held out for testing) (Trainingdata, testData) = Data.randomsplit ([0.7, 0.3])
# Train a GBT model. GBT = Gbtclassifier (labelcol= "Indexedlabel", featurescol=" Indexedfeatures ", maxiter=10) # Chain indexers and GBT in a Pipeline Pipeline = Pipeline (stages=[labelind  Exer, Featureindexer, GBT]) # Train model.
This also runs the indexers.
Model = Pipeline.fit (trainingdata) # make predictions.
predictions = Model.transform (testData) # Select example rows to display. Predictions.select ("Prediction", "Indexedlabel", "features"). Show (5) # Select (prediction, True label) and COMPUTE test E Rror evaluator = Multiclassclassificationevaluator (labelcol= "Indexedlabel", predictioncol= "prediction", metricName= "accuracy") accuracy = evaluator.evaluate (predictions) print ("Test Error =%g"% (1.0-accuracy)) Gbtmodel = Model.stage S[2] Print (Gbtmodel) # Summary only


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.