Introduction and application of Sparkmllib 02-pipeline

Source: Internet
Author: User
Tags foreach documentation prepare hadoop mapreduce

Key concepts in pipeline pipeline components Transformers estimators Parameters saving and loading pipeline pipeline applications Example1 Example2

A typical machine learning machine learning process typically includes: source data ETL, data preprocessing, index extraction, model training and cross-validation, new data prediction, etc. We can see that this is a pipelined work with multiple steps, that is, the data starts from the collection and goes through multiple steps to get the output we need.

If the structure of the target dataset needs to be processed multiple times, or when the new data is predicted, combined with multiple trained individual models for comprehensive prediction (the idea of integrated learning), then using MLlib will make the program structure complex, difficult to understand and implement.

After the release of Spark 1.2, ML Pipeline, a new library that can be used to build complex machine learning workflow applications, has evolved over several versions, so far it has become stable and easy to use. the main concepts in pipeline

Spark ML Pipeline was inspired by the Scikit-learn project and summed up the drawbacks of MLlib in dealing with complex machine learning issues, designed to provide users with a higher-level API library based on DataFrame to make it easier to build complex Machine learning workflow applications.

A Pipeline is structurally composed of one or more pipelinestage, each pipelinestage a task, such as data set processing conversions, model training, parameter setting, or data prediction, such that the pipelinestage in ML There are corresponding definitions and implementations according to the different types of processing problems.

In fact, similar to the concept of pipeline in other popular open source projects also have applications, such as TensorFlow, Keras, Theano and other deep learning framework, the different network layers together, such as Oozie, the different types of jobs together.

The DATAFRAME:ML API uses this concept from spark SQL as an ML dataset and can hold multiple data types. For example, use different columns to store text, feature vectors, real labels, and forecast results.

Transformer: Converter, is a pipelinestage, implementation is also inherited from the Pipelinestage class, mainly used to convert a DataFrame to another DataFrame, such as a model is a Transformer , because it can DataFrame a test data set that does not contain a predictive tag into another DataFrame that contains a predictive label, and it is clear that such a result set can be used to visualize the results of the analysis.

Estimator: An evaluator or adapter, typically used in Pipeline to manipulate DataFrame data and produce a Transformer, such as a random forest algorithm, is a estimator, Because it can train characteristic data to get a random forest model. Implementation on Estimator is also inherited from the Pipelinestage class.

Parameter: Used to set parameters for Transformer or estimator.

Pipeline: A Pipeline chain is a combination of multiple transformer and estimator together to form an ML workflow.
To build a Pipeline, we first need to define the various pipelinestage in the Pipeline, such as indicator extraction and transformation model training. With these Transformer and estimator that deal with specific problems, we can organize pipelinestages in an orderly manner with specific processing logic and create a Pipeline, such as Val Pipeline = new Pipeline (). SE Tstages (Array (Stage1,stage2,stage3,...)). You can then use the training data set as an entry and invoke the Fit method of the Pipelin instance to begin streaming the source training data, which returns a Pipelinemodel class instance that is used to predict the label of the test data, which is a Transformer. Pipeline Components Transformers

Transformer is an abstract class that includes both feature transformations and post-learning models. Either way, you need to implement method transform () to convert one dataframe to another, usually by adding one or more columns.

Like what:

The feature conversion is to read a column (for example: text) and then map it to a new column (for example: eigenvectors), and the new dataframe of the output is to add the newly mapped columns to the original dataframe.

Post-learning model: reads a dataframe (containing eigenvectors), predicts the label for each eigenvector, and finally outputs a new dataframe that carries the predicted results. estimators

This is an abstraction of a learning algorithm that can be used for data fit or train. A estimator needs to implement a method fit (), which receives a dataframe and outputs a model that is transformer. For example: Logisticregression is a estimator, by calling fit () training to get Logisticregressionmodel, this output model is transformer.

Mllib's estimator and transformer use a unified API to manage parameters. Parameters

Parammap is a collection of (parameter, value).

There are two ways to pass parameters to an algorithm:

1. Set parameters by example. For example, if LR is an instance of Logisticregression, then Lr.setmaxiter (10) is the parameter that allows LR to use the maximum of 10 iterations when invoking the Fit () method.

2. Set parameters for Fit () and transform () via Parammap. If you pass in the Parammap object in method Fit () or transform (), the previous parameter settings are overwritten.

For example, if there are two instances of the Logisticregression type LR1 and LR2, we can use Parammap to assign values to the parameters Parammap (Lr1.maxiter, Lr2.maxiter, 20). See below for specific code examples. Save and load pipeline

There are a number of situations where you need to save a model or a pipeline to your hard disk. The vast majority of basic transformer and base ML models are supported, but the specific one depends on the API documentation of the algorithm to query for support. Pipeline Applications Example1

Import org.apache.spark.ml.classification.LogisticRegression Import Org.apache.spark.ml.linalg. {Vector, Vectors} import org.apache.spark.ml.param.ParamMap import Org.apache.spark.sql.Row//Prepare training data fro
M a list of (label, features) tuples.
  Val training = Spark.createdataframe (Seq (1.0, Vectors.dense (0.0, 1.1, 0.1)), (0.0, Vectors.dense (2.0, 1.0,-1.0)), (0.0, Vectors.dense (2.0, 1.3, 1.0)), (1.0, Vectors.dense (0.0, 1.2, -0.5))). TODF ("label", "Features")//Create a Log Isticregression instance.
This instance was an estimator.
val lr = new logisticregression ()//Print out of the parameters, documentation, and any default values.
println ("logisticregression parameters:\n" + lr.explainparams () + "\ n")//We may set parameters using setter methods. Lr.setmaxiter (Setregparam) (0.01)//Learn a logisticregression model.
This uses the parameters stored in LR. Val model1 = lr.fit (Training)//Since Model1 is a Model (i.e., a Transformer produced by an estimator),// We can view the parameters it used during fit ().
This prints the parameter (name:value) pairs, where names is unique IDs for this//logisticregression instance. println ("Model 1 is fit using parameters:" + model1.parent.extractParamMap)//We may alternatively specify parameters
Using a parammap,//which supports several methods for specifying parameters. Val parammap = Parammap (Lr.maxiter). Put (Lr.maxiter, +)//specify 1 Param.
  This overwrites the original maxiter.

. put (Lr.regparam, 0.1, Lr.threshold-0.55)//Specify multiple Params.
One can also combine Parammaps.
Val paramMap2 = Parammap (Lr.probabilitycol, "myprobability")//Change output column name.
Val parammapcombined = parammap + + PARAMMAP2//Now learn a new model using the parammapcombined parameters.
Parammapcombined overrides all parameters set earlier via lr.set* methods. Val Model2 = lr.fit (training, parammapcombined) println ("Model 2 was fit using parameters:" + modeL2.PARENT.EXTRACTPARAMMAP)//Prepare test data. Val test = Spark.createdataframe (Seq ((1.0, Vectors.dense ( -1.0, 1.5, 1.3)), (0.0, Vectors.dense (3.0, 2.0,-0.1)), (1 .0, Vectors.dense (0.0, 2.2, -1.5))). TODF ("label", "Features")//Make predictions on test data using the transformer.tr
Ansform () method.
Logisticregression.transform would only use the ' Features ' column. Note that Model2.transform () outputs a ' myprobability ' column instead of the usual//' probability ' column since we ren
Amed the Lr.probabilitycol parameter previously. Model2.transform (Test). Select ("Features", "label", "Myprobability", "prediction"). Collect (). foreach {case Row (fe  Atures:vector, Label:double, Prob:vector, prediction:double)-println $prob, prob= (S "($features, $label), prediction= $prediction ")}
Example2
Import org.apache.spark.ml. {Pipeline, Pipelinemodel} import org.apache.spark.ml.classification.LogisticRegression Import Org.apache.spark.ml.feature. {HASHINGTF, tokenizer} import org.apache.spark.ml.linalg.Vector import Org.apache.spark.sql.Row//Prepare Training
Documents from a list of (ID, text, label) tuples.
  Val training = Spark.createdataframe (Seq ((0L, "a b C D e Spark", 1.0), (1L, "b d", 0.0), (2L, "Spark f G H", 1.0), (3L, "Hadoop mapreduce", 0.0)).
TODF ("id", "text", "label")//Configure an ML pipeline, which consists of three Stages:tokenizer, HASHINGTF, and LR. Val tokenizer = new Tokenizer (). Setinputcol ("text"). Setoutputcol ("words") val HASHINGTF = new HASHINGTF (). setnumf Eatures (+) Setinputcol (tokenizer.getoutputcol). Setoutputcol ("Features") val lr = new Logisticregression (). Setm Axiter Setregparam (0.001) Val pipeline = new Pipeline (). Setstages (Array (Tokenizer, HASHINGTF, LR))//Fit the P
Ipeline to training documents. Val Model = Pipeline.fit (Training)//Now we can optionally save the fitted pipeline to disk Model.write.overwrite (). Save ("/tmp/sp Ark-logistic-regression-model ")//We can also save this unfit pipeline to disk pipeline.write.overwrite (). Save ("/tmp/unf It-lr-model ")//and load it back in during production val Samemodel = Pipelinemodel.load ("/tmp/spark-logistic-regression
-model ")//Prepare test documents, which is unlabeled (ID, text) tuples.  Val test = Spark.createdataframe (Seq ((4L, "Spark I J K"), (5L, "l m N"), (6L, "Spark Hadoop Spark"), (7L, "Apache
Hadoop ")). TODF (" id "," text ")//Make predictions on test documents. Model.transform (Test). Select ("id", "text", "Probability", "prediction"). Collect (). foreach {case Row (Id:long, TE 
  Xt:string, Prob:vector, prediction:double) = println (S "($id, $text)-prob= $prob, prediction= $prediction") }

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.