The upcoming Apache Spark 2.0 will provide a machine learning model persistence capability. The persistence of machine learning models (the preservation and loading of machine learning models) makes the following three types of machine learning scenarios easier:
Data scientists develop the ML model and hand it over to the engineer team for release in the production environment;
The data engineer integrates a machine learning model training workflow developed by a Python language into a Java language developed machine learning Service workflow;
Data scientists create multiple jobs that train the ML model and need to be saved and evaluated later.
Spark Mllib will provide a Dataframe based API to support ML persistence. The following three sections describe the small details of the overview, the code instance, and the Mllib persistence API. Profile
Key features of ML persistence:
Supports Spark's original development languages: Scala, Java, and Python & R;
The Dataframe API almost supports all ML algorithms;
Supports a single ML model and a multiple pipeline ml model;
Using a convertible format for distributed storage machine learning models API
In Apache Spark 2.0, the stre piece Mllib provides a dataframe based API for saving and loading functions similar to the Spark data source APIs, as seen in previous articles.
The authors use classic machine learning examples (handwritten digit recognition, using the Mnist database, mnist databases containing 0 to 9 of handwritten digits and tagged data) to validate the ML model save and load functions. The author takes other handwritten digits and identifies the number, complete example code see notebook: Loading data, training model and saving and loading model.
Save and load a single model
First shows how to save and load the same single model using different programming languages. The authors use Python to train and save a random forest classifier model, and then use Scala to load the same ML model back.
Training = Sqlcontext.read ... # data:features, label
RF = Randomforestclassifier (numtrees=20)
model = Rf.fit (Training)
You can simply invoke the Save method to save the well trained ML model and then load it back with the load method.
Model.save ("Mymodelpath")
Samemodel = Randomforestclassificationmodel.load ("Mymodelpath")
You can also load just the same ML model (saved using Python) into Scala or Java apps.
Load the model in Scala
val samemodel = randomforestclassificationmodel.load ("Mymodelpath")
This work can be applied to small data volumes, local models (e.g., common classification model K-means), or to mass data, distributed models (e.g., the common recommended model ALS). The model that has just been loaded contains the same parameter settings and training data, so even if the same model is loaded in different spark deployments, the same prediction results are obtained. Saving and loading a multi-pipe model
The previous description is just to save and load a single ML model, whereas in practice the ML workflow includes multiple phases: From feature extraction and transformation to model fitting and optimization. Mllib provides pipeline to assist users to build these workflows.
Mllib provides a user to save and load the entire pipeline. Let's look at how to implement:
Feature extraction: image data binary to 0 and 1 (black and white);
Model fit: The forest classifier then reads the image data and predicts the number 0 to 9;
Optimization results: Cross-validation to optimize the depth of the tree. Look at the code:
Construct the Pipeline:binarizer + Random Forest
val Pipeline = new Pipeline (). Setstages (Array (Binarizer, RF))
//Wrap the Pipeline in Crossvalidator to do model tuning.
Val cv = new Crossvalidator (). Setestimator (Pipeline) ...
Let's show you how to save the entire ML workflow before this pipe fitting model. This workflow will be loaded later in the other spark cluster.
Cv.save ("Mycvpath")
val samecv = crossvalidator.load ("Mycvpath")
Finally, we fit the model pipe, save the pipe, and load it later. Features extraction, optimization of stochastic forest model cross-validation and statistical data for model optimization are saved below.
Val Cvmodel = cv.fit (Training)
Cvmodel.save ("Mycvmodelpath")
val Samecvmodel = Crossvalidatormodel.load (" Mycvmodelpath ")
Detail Knowledge dot python optimization
Python optimizations are not available in Spark 2.0, and Python does not support saving and loading crossvalidator and trainvalidationsplit for model hyper-parameter optimization, which will be in spark Implementation in 2.1 (SPARK-13786). But Python can still save the results of Crossvalidator and Trainvalidationsplit. For example, we can use cross-validation to optimize a random forest model and save a debug optimization model.
# Define the workflow
RF = randomforestclassifier ()
CV = crossvalidator (ESTIMATOR=RF, ...)
# fit the model, running cross-validation
Cvmodel = Cv.fit (trainingdata)
# Extract the results, i.e., the best Ran Dom Forest model
Bestmodel = Cvmodel.bestmodel
# Save The randomforest model
bestmodel.save ("Rfmodelpath")
Convertible storage Format
Essentially, we store the model metadata and parameters as JSON, and the dataset is stored as parquet. These storage formats are convertible and can also be read by other development libraries. Parquet files allow users to store small models (for example, Bayes classification) and distributed models (e.g., ALS). The storage path can be any dataset/dataframe-supported URI, such as S3, local storage, and so on. Cross-language compatibility
Machine learning models can be stored and loaded arbitrarily between Scala, Java, and Python & R. But R language has two limitations: first, not all Mllib models support R language, so not all models that use other language training can be loaded by R language; second, the R-model format store, which uses R-unique methods, is not easily used by other languages. Conclusion
With the upcoming release of Spark 2.0, the Mllib API based on Dataframe will provide almost perfect model and machine learning pipeline persistence. The persistence of machine learning models is important in teamwork, multi programming language ml workflow, and migration model to production environment. The Mllib API based on Dataframe will eventually become the main API for spark in machine learning.