Cross-validation principle and spark Mllib use Example (Scala/java/python)

Source: Internet
Author: User
Tags pyspark hadoop mapreduce spark mllib

Cross-validation

method thought:

Crossvalidator divides the dataset into several subsets for training and testing respectively. When K=3, Crossvalidator produces 3 training data and test data pairs, each data is trained with 2/3 of the data, and 1/3 of the data is tested. For a specific set of parameter tables, Crossvalidator calculates the average of the evaluation criteria for the training model based on three sets of different training data and test data. After the optimal parameter table is determined, Crossvalidator finally uses the best parameter table to fit the estimator based on all the data.

Example:

Note that the cost of cross-validation for a parameter grid is high. As in the following example, the parameter grid hashingtf.numfeatures has 3 values, Lr.regparam has 2 values, and Crossvalidator uses 20 percent cross-validation. This produces different models in the (3*2) *2=12 that need to be trained. In the actual settings, there are usually more parameters to be set, and we may use more cross-validation folds (30 percent or 10 percent are all used). So the cost of crossvalidator is very high, however, compared with heuristic manual validation, cross-validation is still a very useful parameter selection method in existence.

Scala:

Import org.apache.spark.ml.Pipeline Import org.apache.spark.ml.classification.LogisticRegression Import Org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.feature. {HASHINGTF, tokenizer} import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.tuning. {crossvalidator, Paramgridbuilder} import Org.apache.spark.sql.Row//Prepare training data from a list of (ID, text, lab
EL) tuples.
  Val training = Spark.createdataframe (Seq (0L, "a b C D e Spark", 1.0), (1L, "b d", 0.0), (2L, "Spark f G", 1.0),  (3L, "Hadoop mapreduce", 0.0), (4L, "B Spark", 1.0), (5L, "G D a Y", 0.0), (6L, "Spark Fly", 1.0), (7L, "was MapReduce ", 0.0), (8L," E Spark Program ", 1.0), (9L," A E c L ", 0.0), (10L," Spark compile ", 1.0), (11L," Hadoop Software ", 0.0)). TODF (" id "," text "," label ")//Configure an ML pipeline, which consists of three stages:tokenizer, ha
SHINGTF, and LR. Val tokenizer = new Tokenizer (). Setinputcol ("text"). SetouTputcol ("words") val HASHINGTF = new HASHINGTF (). Setinputcol (Tokenizer.getoutputcol). Setoutputcol ("Features") Val LR

= new Logisticregression () Setmaxiter val pipeline = new Pipeline (). Setstages (Array (Tokenizer, HASHINGTF, LR))
We use a paramgridbuilder to construct a grid of the parameters to search over. With the 3 values for Hashingtf.numfeatures and 2 values for Lr.regparam,//This grid would have 3 x 2 = 6 parameter Settin
GS for Crossvalidator to choose from. Val Paramgrid = new Paramgridbuilder (). Addgrid (Hashingtf.numfeatures, Array (M, 1000)). Addgrid (Lr.regparam, ARR
Ay (0.1, 0.01)). Build ()//We now treat the Pipeline as a estimator, wrapping it in a crossvalidator instance.
This would allow us to jointly choose parameters for all Pipeline stages.
A crossvalidator requires an estimator, a set of estimator Parammaps, and a evaluator.
The evaluator is a binaryclassificationevaluator and its default metric/are Areaunderroc. VaL CV = new Crossvalidator (). Setestimator (Pipeline). Setevaluator (new Binaryclassificationevaluator). Setestimatorpa Rammaps (Paramgrid). Setnumfolds (2)//Use 3+ in practice//Run cross-validation, and choose the best set of parameter
S. val Cvmodel = cv.fit (Training)//Prepare test documents, which are unlabeled (ID, text) tuples. Val test = Spark.createdataframe (Seq (4L, "Spark I J K"), (5L, "l m N"), (6L, "MapReduce Spark"), (7L, "Apache ha Doop ")). TODF (" id "," text ")//Make predictions on test documents.
Cvmodel uses the best model found (Lrmodel). Cvmodel.transform (Test). Select ("id", "text", "Probability", "prediction"). Collect (). foreach {case Row (Id:long, Text:string, Prob:vector, prediction:double) => println (S "($id, $text)--> prob= $prob, prediction= $prediction ")
  }
Java:

Import Java.util.Arrays;
Import Org.apache.spark.ml.Pipeline;
Import Org.apache.spark.ml.PipelineStage;
Import org.apache.spark.ml.classification.LogisticRegression;
Import Org.apache.spark.ml.evaluation.BinaryClassificationEvaluator;
Import Org.apache.spark.ml.feature.HashingTF;
Import Org.apache.spark.ml.feature.Tokenizer;
Import Org.apache.spark.ml.param.ParamMap;
Import Org.apache.spark.ml.tuning.CrossValidator;
Import Org.apache.spark.ml.tuning.CrossValidatorModel;
Import Org.apache.spark.ml.tuning.ParamGridBuilder;
Import Org.apache.spark.sql.Dataset;

Import Org.apache.spark.sql.Row;
Prepare training documents, which are labeled.
  dataset<row> training = Spark.createdataframe (Arrays.aslist (New Javalabeleddocument (0L, "a b C D e Spark", 1.0),  New Javalabeleddocument (1L, "b d", 0.0), New javalabeleddocument (2L, "Spark f G H", 1.0), New Javalabeleddocument (3L, "Hadoop mapreduce", 0.0), New javalabeleddocument (4L, "B Spark who", 1.0), New Javalabeleddocument (5L, "G D a Y", 0.0), New javalabeleddocument (6L, "Spark Fly", 1.0), New Javalabeleddocument (7L, "was MapReduce", 0.0), New Javalabeleddocument (8L, "E Spark Program", 1.0), New Javalabeleddocument (9L, "A E c L", 0.0), new javalabeleddocum

ENT (10L, "Spark compile", 1.0), New Javalabeleddocument (11L, "Hadoop software", 0.0)), javalabeleddocument.class);
Configure an ML pipeline, which consists of three Stages:tokenizer, HASHINGTF, and LR.
Tokenizer tokenizer = new Tokenizer (). Setinputcol ("text"). Setoutputcol ("words"); HASHINGTF HASHINGTF = new HASHINGTF (). Setnumfeatures (1000). Setinputcol (Tokenizer.getoutputcol ()). SetOutputCol ("Fe
Atures ");
logisticregression lr = new Logisticregression (). Setmaxiter. Setregparam (0.01);

Pipeline Pipeline = new Pipeline (). Setstages (New pipelinestage[] {tokenizer, HASHINGTF, LR});
We use a paramgridbuilder to construct a grid of the parameters to search over. With 3 values for Hashingtf.numfeatures and 2 values for LR.regparam,//This grid would have 3 x 2 = 6 parameter settings for Crossvalidator to choose from. parammap[] Paramgrid = new Paramgridbuilder (). Addgrid (Hashingtf.numfeatures (), new int[] {ten, 1000}). Addgrid (LR

. Regparam (), new double[] {0.1, 0.01}). build ();
We now treat the Pipeline as a estimator and wrapping it in a crossvalidator instance.
This would allow us to jointly choose parameters for all Pipeline stages.
A crossvalidator requires an estimator, a set of estimator Parammaps, and a evaluator.
The evaluator is a binaryclassificationevaluator and its default metric/are Areaunderroc.
  Crossvalidator CV = new Crossvalidator (). Setestimator (Pipeline). Setevaluator (New Binaryclassificationevaluator ())  . Setestimatorparammaps (Paramgrid). Setnumfolds (2);
Use 3+ in practice//Run cross-validation, and choose the best set of parameters.

Crossvalidatormodel Cvmodel = cv.fit (training);
Prepare test documents, which are unlabeled. Dataset<row> test = Spark.createdataframe (Arrays.aslist (New Javadocument (4L, "Spark I J"), New Javadocument (5L, "l

M n "), new Javadocument (6L," MapReduce Spark "), new Javadocument (7L," Apache Hadoop ")), Javadocument.class); Make predictions on test documents.
Cvmodel uses the best model found (Lrmodel).
dataset<row> predictions = cvmodel.transform (test); For (Row r:predictions.select ("id", "text", "Probability", "prediction"). Collectaslist ()) {System.out.println ("+ R . Get (0) + "," + r.get (1) + ")--> prob=" + r.get (2) + ", prediction=" + r.get (3));}
Python:

From pyspark.ml import Pipeline to pyspark.ml.classification import logisticregression from pyspark.ml.evaluation Import binaryclassificationevaluator from pyspark.ml.feature import HASHINGTF, tokenizer from pyspark.ml.tuning Import
Crossvalidator, Paramgridbuilder # Prepare training documents, which are.
    Training = Spark.createdataframe ([0, "a b C D e Spark", 1.0), (1, "b d", 0.0), (2, "Spark f G", 1.0), (3, "Hadoop mapreduce", 0.0), (4, "B Spark who", 1.0), (5, "G D a Y", 0.0), (6, "Spark Fly", 1.0), (7, "WA S MapReduce ", 0.0), (8," E Spark Program ", 1.0), (9," A E c L ", 0.0), (A," Spark compile ", 1.0), (one," ha Doop software ", 0.0)], [" id "," text "," label "]) # Configure an ML pipeline, which consists to tree Stages:tokenizer, ha
SHINGTF, and LR. Tokenizer = Tokenizer (inputcol= "text", outputcol= "words") HASHINGTF = HASHINGTF (Inputcol=tokenizer.getoutputcol (), Outputcol= "Features") LR = logisticregression (maxiter=10) pipeline = Pipeline (Stages=[tokenizer, HASHINGTF, LR]) # We now treat the Pipeline as a estimator, wrapping it in a crossval
Idator instance.
# This'll allow us to jointly choose parameters for all Pipeline stages.
# A crossvalidator requires an estimator, a set of estimator Parammaps, and a evaluator.
# We use a paramgridbuilder to construct a grid of the parameters to search over.  # with 3 values for Hashingtf.numfeatures and 2 values for Lr.regparam, # This grid would have 3 x 2 = 6 parameter settings
For Crossvalidator to choose from. Paramgrid = Paramgridbuilder () \ Addgrid (Hashingtf.numfeatures, [ten, 1000]) \. Addgrid (Lr.regparam, [0.1, 0. 
                          ] \. Build () Crossval = Crossvalidator (Estimator=pipeline, Estimatorparammaps=paramgrid, Evaluator=binaryclassificationevaluator (), numfolds=2) # Use 3+ fold
s in practice # Run Cross-validation, and choose the best set of parameters. Cvmodel = CROSSVAl.fit (Training) # Prepare test documents, which are unlabeled. Test = Spark.createdataframe ([(4, "Spark I J K"), (5, "L m N"), (6, "MapReduce Spark"), (7, "Apache Hadoo P ")], [" id "," text "]) # Make predictions on test documents.
Cvmodel uses the best model found (Lrmodel). Prediction = Cvmodel.transform (Test) selected = Prediction.select ("id", "text", "Probability", "prediction") for row in SE Lected.collect (): Print (ROW)
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.