Spark ML Model pipelines on distributed Deep neural Nets
This notebook describes how to build machine learning pipelines with Spark ML for distributed versions of Keras deep ING models. As data set we use the Otto Product Classification challenge from Kaggle. The reason we chose this data are that it is small and very structured. This is way, we can focus the more on technical components rather than prepcrocessing. Also, users with slow hardware or without a full-blown Spark cluster should is able to run this example locally, and still Learn a lot about the distributed mode.
Often, the need to distribute computation are not imposed by model training, but rather by building the data pipeline, i.e. Ingestion, transformation etc. In training, deep neural networks tend to do fairly down one or more GPUs on one machine. Most of the time, using gradient descent methods, you'll process one batch after another anyway. Even so, it may still is beneficial to use frameworks like Spark to integrate your the with models your Ture. On ' That ', the convenience provided by Spark ML pipelines can is very valuable (being syntactically very close to WHA T you might know from Scikit-learn).
TL;DR: We'll show tackle a classification problem using distributed deep neural nets and Spark ML pipelines in an Exampl E is essentially a distributed version of the this one found here. Using This notebook
As we are going to use Elephas, you'll need access to a running Spark the context to run this notebook. If you don ' t have it already, install Spark locally from following the instructions provided here. Make sure to also export spark_home to your path and start your Ipython/jupyter notebook as follows:
Ipython_opts= "Notebook" ${spark_home}/bin/pyspark--driver-memory 4G ELEPHAS/EXAMPLES/SPARK_ML_PIPELINE.IPYNB
To test your environment, try to print the Spark context (provided as SC), i.e. execute the following cell.
From __future__ import print_function
print (SC)
<pyspark.context.sparkcontext Object at 0x1132d61d0>
Otto Product Classification Data
Training and test data is available here. Go ahead and download the data. Inspecting it, you'll be the provided CSV files consist of an ID column, feature columns. Train.csv has a additional column for labels, which test.csv is missing. The challenge is to accurately predict test labels. For the rest of this notebook, we'll assume the data is stored at Data_path, which your should modify below as needed.
Data_path = "./" # <--Make sure to adapt this to where your CSV files are.
Loading data is relatively simple, but we have to take care of a few. In the shuffle rows of a RDD, it is generally not very efficient. But since data in Train.csv are sorted by category, we'll have to shuffle in order to make the model perform. This is what the function shuffle_csv below are for. Next, we read in plain text in Load_data_rdd, the split lines by the comma and convert features to float vector type. Also, note this last column in Train.csv represents the category, which has a class_ prefix. Defining Data Frames
Spark has a few core data structures, among them is the data frame, which is a distributed version of the named Columnar D ATA structure Many would now either R or pandas. We need a so called sqlcontext and a optional column-to-names mapping to create a data frame from scratch.
From Pyspark.sql import sqlcontext from pyspark.mllib.linalg import Vectors import numpy as NP import random Sql_context = SqlContext (SC) def shuffle_csv (csv_file): lines = open (Csv_file). ReadLines () random.shuffle (lines) Open (CSV _file, ' W '). Writelines (lines) def load_data_frame (Csv_file, Shuffle=true, train=true): If Shuffle:shuffle_cs
V (csv_file) data = Sc.textfile (Data_path + csv_file) # This is a RDD, which'll later be transformed to a data frame data = Data.filter (lambda x:x.split (', ') [0]!= ' ID '). Map (Lambda line:line.split (', ')) if train:data = d Ata.map (Lambda Line: (Vectors.dense (Np.asarray (line[1:-1)). Astype (Np.float32)), str (Line[-1]) else: # Test data gets dummy labels. We need the same structure as in Train data = Data.map (lambda line: Vectors.dense Np.asarray (line[1:]). asty PE (NP.FLOAT32)), "Class_1") return Sqlcontext.createdataframe (data, [' FeatureS ', ' category '])
Let's load both train and test data and print a few rows of data using the convenient Show method.
TRAIN_DF = Load_data_frame ("train.csv")
TEST_DF = Load_data_frame ("Test.csv", Shuffle=false, Train=False) # No need To shuffle test data
print ("Train data frame:")
train_df.show
print ("Test data Frame", note the dummy Cate Gory): "
test_df.show" (10)
Train Data frame: +--------------------+--------+ |
features|category| +--------------------+--------+
| [0.0,0.0,0.0,0.0,... | class_8| | [0.0,0.0,0.0,0.0,... | class_8| | [0.0,0.0,0.0,0.0,... | class_2| | [0.0,1.0,0.0,1.0,... | class_6| | [0.0,0.0,0.0,0.0,... | class_9| | [0.0,0.0,0.0,0.0,... | class_2| | [0.0,0.0,0.0,0.0,... | class_2| | [0.0,0.0,0.0,0.0,... | class_3| | [0.0,0.0,4.0,0.0,... | class_8| | [0.0,0.0,0.0,0.0,... |
class_7| +--------------------+--------+ only showing top rows Test data frame (Dummy category): +-------------------- +--------+
|
features|category| +--------------------+--------+
| [1.0,0.0,0.0,1.0,... | class_1| | [0.0,1.0,13.0,1.0...| class_1| | [0.0,0.0,1.0,1.0,... | class_1| | [0.0,0.0,0.0,0.0,... | class_1| | [2.0,0.0,5.0,1.0,... | class_1| | [0.0,0.0,0.0,0.0,... | class_1| | [0.0,0.0,0.0,0.0,... | class_1| | [0.0,0.0,0.0,1.0,... | class_1| | [0.0,0.0,0.0,0.0,... | class_1| | [0.0,0.0,0.0,0.0,... |
class_1| +--------------------+--------+ only showing topTen rows
preprocessing:defining Transformers
Up until now, we are basically just read in raw data. Luckily, Spark ML has quite a few preprocessing features available, so the only thing we'll ever have to does is define TR Ansformations of data frames.
To proceed, we'll transform category strings to double values. This are done by the a so called stringindexer. Note this we carry out of the actual transformation here already, but this is just for demonstration. All we really need are too define string_indexer to put it in pipeline later on.
From pyspark.ml.feature import stringindexer
string_indexer = stringindexer (inputcol= "category", outputcol= " Index_category ")
fitted_indexer = String_indexer.fit (TRAIN_DF)
INDEXED_DF = Fitted_indexer.transform (train _DF)
Next, it ' s good practice to normalize the features, which are done with a standardscaler.
From pyspark.ml.feature import standardscaler
scaler = Standardscaler (inputcol= "Features", outputcol= "Scaled_" Features ", Withstd=true, withmean=true)
Fitted_scaler = Scaler.fit (INDEXED_DF)
SCALED_DF = Fitted_ Scaler.transform (INDEXED_DF)
Print ("The result of indexing and scaling.") Each transformation adds new columns to the data frame: ")
Scaled_df.show (10)
The result of indexing and scaling. Each transformation adds new columns to the data frame: +--------------------+--------+--------------+----------------- ---+
| features|category|index_category|
scaled_features| +--------------------+--------+--------------+--------------------+
| [0.0,0.0,0.0,0.0,... | class_8| 2.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... | class_8| 2.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... | class_2| 0.0| [ -0.2535060296260...| | [0.0,1.0,0.0,1.0,... | class_6| 1.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... | class_9| 4.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... | class_2| 0.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... | class_2| 0.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... | class_3| 3.0| [ -0.2535060296260...| | [0.0,0.0,4.0,0.0,... | class_8| 2.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... | class_7| 5.0| [ -0.2535060296260...| +--------------------+--------+--------------+--------------------+ only showing top rows
Keras Deep Learning model
Now so we have a data frame with processed features and labels, let ' s define a deep neural net so we can use to addres s the classification problem. Chances are you came this because you know a thing or two about deep learning. If So, the model below'll look very straightforward. We Build a Keras the model by choosing a set of three consecutive dense layers with dropout and relu. There are certainly much better architectures for the problem out There, but we really just want to demonstrate the genera L Flow here.
From keras.models import sequential
to Keras.layers.core import dense, dropout, activation from
keras.utils Import np_utils, generic_utils
nb_classes = train_df.select ("category"). Distinct (). Count ()
Input_dim = Len ( Train_df.select ("Features"). The
model = sequential () model.add (dense
, input_shape=, ))
Model.add (Activation (' Relu '))
Model.add (Dropout (0.5)) Model.add (dense ())
Model.add ( Activation (' Relu '))
Model.add (Dropout (0.5)) Model.add (dense ())
model.add (Activation (' Relu '))
Model.add (Dropout (0.5))
Model.add (Dense (nb_classes))
Model.add (Activation (' Softmax ')
) Model.compile (loss= ' categorical_crossentropy ', optimizer= ' Adam ')
Distributed Elephas model
To lift the above Keras model to Spark, we define a estimator on top of it. An estimator was Spark ' s incarnation of a model that still has to be trained. It essentially only comes with a single (required) method, namely fit. Once We call fit on the A data frame, we get back a model, the which is a trained model with a transform method to predict labels.
We do this by initializing a elephasestimator and setting a few properties. As by now we input data frame would have many columns, we have to tell the model where to find features and labels by Colu MN name. Then We provide serialized versions of Keras model and Elephas. We can not plug in Keras models into the estimator directly, as Spark'll have to serialize them for anyway With workers, so it's better to provide the serialization ourselves. In fact, while Pyspark knows the how to serialize model, it's extremely inefficient and can break if models become too. Spark ML especially picky (and rightly so) about parameters and more or less prohibits your from providing non-atomic t Ypes and arrays of the latter. Most of the remaining parameters are optional and rather self explainatory. Plus, many of them you know if you have ever run a Keras model before. We just include them in the full set of training configuration.
From Elephas.ml_model import elephasestimator from Elephas import optimizers as Elephas_optimizers # Define Elephas optim Izer (which tells the model how to aggregate updates on the Spark master) Adadelta = Elephas_optimizers. Adadelta () # Initialize SPARKML Estimator and set all relevant properties estimator = Elephasestimator () estimator.setfea Turescol ("Scaled_features") # These two come directly from Pyspark, Estimator.setlabelcol ("Index_category") # Hence the camel case. Sorry:) estimator.set_keras_model_config (Model.to_yaml ()) # provide serialized Keras model Estimator.set_optimizer_
Config (Adadelta.get_config ()) # provide serialized Elephas optimizer Estimator.set_categorical_labels (True) Estimator.set_nb_classes (nb_classes) estimator.set_num_workers (1) # We just use one of the worker here.
Feel free to adapt it. Estimator.set_nb_epoch (Estimator.set_batch_size) (128) estimator.set_verbosity (1) estimator.set_validation_ Split (0.15)
elephasestimator_415398ab22cb1699f794
SPARKML Pipelines
Now for the easy part:defining pipelines are really as easy as listing pipeline stages. We can provide any configuration of transformers and estimators really, but here we simply take the three components Ed earlier. Note So string_indexer and Scaler and interchangable, while estimator somewhat obviously, has to come E.
From pyspark.ml import Pipeline
Pipeline = Pipeline (Stages=[string_indexer, Scaler, estimator])
Fitting and evaluating the pipeline
The pipeline on training data and evaluate it. We are evaluate, i.e. transform, on training data, since the ' in ' case does we have labels to check accuracy of the model. If you like, you could transform the TEST_DF as.
from pyspark.mllib.evaluation import multiclassmetrics fitted_pipeline = Pipeline.fit (train
_DF) # Fit model to data prediction = Fitted_pipeline.transform (TRAIN_DF) # Evaluate on train data.
# prediction = Fitted_pipeline.transform (TEST_DF) # <--The same code evaluates test data. PNL = Prediction.select ("Index_category", "prediction") pnl.show (MB) Prediction_and_label = Pnl.map (lambda row: ( Row.index_category, row.prediction)) metrics = Multiclassmetrics (prediction_and_label) print (Metrics.precision ())
61878/61878 [==============================]-0s +--------------+----------+ |index_category|prediction| +--------------+----------+
| 2.0| 2.0| | 2.0| 2.0| | 0.0| 0.0| | 1.0| 1.0| | 4.0| 4.0| | 0.0| 0.0| | 0.0| 0.0| | 3.0| 3.0| | 2.0| 2.0| | 5.0| 0.0| | 0.0| 0.0| | 4.0| 4.0| | 0.0| 0.0| | 4.0| 1.0| | 2.0| 2.0| | 1.0| 1.0| | 0.0| 0.0| | 6.0| 0.0| | 2.0| 2.0| | 1.0| 1.0| | 2.0| 2.0| | 8.0| 8.0| | 1.0| 1.0| | 5.0| 0.0| | 0.0| 0.0| | 0.0| 3.0| | 0.0| 0.0| | 1.0| 1.0| | 4.0| 4.0| | 2.0| 2.0| | 0.0| 3.0| | 3.0| 3.0| | 0.0| 0.0| | 3.0| 0.0| | 1.0| 5.0| | 3.0| 3.0| | 2.0| 2.0| | 1.0| 1.0| | 0.0| 0.0| | 2.0| 2.0| | 2.0| 2.0| | 1.0| 1.0| | 6.0| 6.0| | 1.0| 1.0| | 0.0| 3.0| | 7.0| 0.0| | 0.0| 0.0| | 0.0| 0.0| | 1.0| 1.0| | 1.0| 1.0| | 6.0| 6.0| | 0.0| 0.0| | 0.0| 3.0| | 2.0| 2.0| | 0.0| 0.0| | 2.0| 2.0| | 0.0| 0.0| | 4.0| 4.0| | 0.0| 0.0| | 6.0| 6.0| | 2.0| 5.0| | 0.0| 3.0| | 3.0| 0.0| | 0.0| 0.0| | 3.0| 3.0| | 4.0| 4.0| | 0.0| 3.0| | 0.0| 0.0| | 0.0| 0.0| | 4.0| 4.0| | 3.0| 0.0| | 2.0| 2.0| | 1.0| 1.0| | 7.0| 7.0| | 0.0| 0.0| | 0.0| 0.0| | 0.0| 3.0| | 1.0| 1.0| | 1.0| 1.0| | 5.0| 4.0| | 1.0| 1.0| | 1.0| 1.0| | 4.0| 4.0| | 3.0| 3.0| | 0.0| 0.0| | 2.0| 2.0| | 4.0| 4.0| | 7.0| 7.0| | 2.0| 2.0| | 0.0| 0.0| | 1.0| 1.0| | 0.0| 0.0| | 4.0| 4.0| | 1.0| 1.0| | 0.0| 0.0| | 0.0| 0.0| | 0.0| 0.0| | 0.0| 3.0| | 0.0| 3.0| | 0.0|
0.0| +--------------+----------+ only showing top rows 0.764132648114
Conclusion
It may certainly take some time to master the principles and syntax of both Keras and Spark, depending where you come from , of course. However, we also hope you come to the conclusion this once you get beyond the stage of struggeling with defining your mode LS and preprocessing your data, the business of building and using SPARKML pipelines is quite a elegant and useful one.
If you like what for you, consider helping further improve Elephas or contributing to Keras or Spark. Do your have any constructive remarks on this notebook? Is there something your want me to clarify? In any case, the feel free to contact me.