How to do deep learning based on spark: from Mllib to Keras,elephas

Source: Internet
Author: User
Tags data structures nets shuffle versions jupyter notebook pyspark neural net keras
Spark ML Model pipelines on distributed deep neural Nets

This notebook describes what to build machine learning pipelines with Spark ML for distributed versions of Keras deep learn ING models. As data set we use the Otto Product Classification challenge from Kaggle. The reason we chose this data is, it is small and very structured. This is, we can focus on the technical components rather than prepcrocessing intricacies. Also, users with slow hardware or without a full-blown Spark cluster should is able to run this example locally, and still Learn a lot about the distributed mode.

Often, the need to distribute computation isn't imposed by model training, but rather by building the data pipeline, i.e. Ingestion, transformation etc. In training, deep neural networks tend to doing fairly well on one or more GPUs on one machine. Most of the time, using gradient descent methods, you'll process one batch after another anyway. Even so, it may still is beneficial to use frameworks like Spark to integrate your models with your surrounding INFRASTRUC Ture. On top of this, the convenience provided by Spark ML pipelines can is very valuable (being syntactically very close to WHA T might know from Scikit-learn).

TL;DR: We'll show how to tackle a classification problem using distributed deep neural nets and Spark ML pipelines in an Exampl E that's essentially a distributed version of the one found here. Using This notebook

As we are going to use Elephas, you'll need access to a running Spark context to run this notebook. If you don't have an IT already, install Spark locally by following the instructions provided here. Make sure to also export spark_home to your path and start your Ipython/jupyter notebook as follows:

Ipython_opts= "Notebook" ${spark_home}/bin/pyspark--driver-memory 4G ELEPHAS/EXAMPLES/SPARK_ML_PIPELINE.IPYNB

To test your environment, try to print the Spark context (provided as SC), i.e. execute the following cell.

From __future__ import print_function
print (SC)
<pyspark.context.sparkcontext Object at 0x1132d61d0>
Otto Product Classification Data

Training and test data is available here. Go ahead and download the data. Inspecting it, you'll see that the provided CSV files consist of an ID column, the feature columns. Train.csv has a additional column for labels, which test.csv is missing. The challenge is to accurately predict test labels. For the rest of this notebook, we'll assume data is stored at Data_path and which you should modify below as needed.

Data_path = "./" # <--Make sure to adapt the where your CSV files are.

Loading data is relatively simple and we have the to take care of a few things. First, while you can shuffle the rows of an RDD, it's generally not very efficient. But since data in Train.csv are sorted by category, we'll have to shuffle in order to make the model perform well. The IS and the function shuffle_csv below is for. Next, we read in plain text in Load_data_rdd, split lines by comma and convert features to float vector type. Also, note the last column in Train.csv represents the category, which have a class_ prefix. Defining Data Frames

Spark has a few core data structures, among them are the data frame, which is a distributed version of the named Columnar D ATA structure Many would now from either R or Pandas. We need a so called SqlContext and an optional column-to-names mapping to create a data frame from scratch.

From Pyspark.sql import sqlcontext from pyspark.mllib.linalg import Vectors import numpy as NP import random Sql_context = SqlContext (SC) def shuffle_csv (csv_file): lines = open (Csv_file). ReadLines () random.shuffle (lines) Open (CSV _file, ' W '). Writelines (lines) def load_data_frame (Csv_file, Shuffle=true, train=true): If Shuffle:shuffle_cs 
    V (csv_file) data = Sc.textfile (Data_path + csv_file) # This was an RDD, which would later be transformed to a data frame data = Data.filter (lambda x:x.split (', ') [0]! = ' id '). Map (Lambda line:line.split (', ')) if train:data = d Ata.map (Lambda Line: (Vectors.dense (Np.asarray (line[1:-1)). Astype (Np.float32)), str (Line[-1])) else: # Test data gets dummy labels. We need the same structure as in Train data data = Data.map (lambda line: (Vectors.dense (Np.asarray:]). line[1 PE (NP.FLOAT32)), "Class_1")) return Sqlcontext.createdataframe (data, [' FeatureS ', ' category ')
 

Let ' s load both train and test data and print a few rows of data using the convenient Show method.

TRAIN_DF = Load_data_frame ("train.csv")
TEST_DF = Load_data_frame ("Test.csv", Shuffle=false, Train=False) # No need To shuffle test data

print ("Train data frame:")
train_df.show

print ("Test data frame" Note the dummy Cate Gory): ")
Test_df.show (10)
Train Data frame: +--------------------+--------+ |
features|category| +--------------------+--------+
| [0.0,0.0,0.0,0.0,... | class_8| | [0.0,0.0,0.0,0.0,... | class_8| | [0.0,0.0,0.0,0.0,... | class_2| | [0.0,1.0,0.0,1.0,... | class_6| | [0.0,0.0,0.0,0.0,... | class_9| | [0.0,0.0,0.0,0.0,... | class_2| | [0.0,0.0,0.0,0.0,... | class_2| | [0.0,0.0,0.0,0.0,... | class_3| | [0.0,0.0,4.0,0.0,... | class_8| | [0.0,0.0,0.0,0.0,... |
class_7| +--------------------+--------+ only showing top rows Test data frame (note the dummy category): +--------------------            +--------+
|
features|category| +--------------------+--------+
| [1.0,0.0,0.0,1.0,... | class_1| | [0.0,1.0,13.0,1.0...| class_1| | [0.0,0.0,1.0,1.0,... | class_1| | [0.0,0.0,0.0,0.0,... | class_1| | [2.0,0.0,5.0,1.0,... | class_1| | [0.0,0.0,0.0,0.0,... | class_1| | [0.0,0.0,0.0,0.0,... | class_1| | [0.0,0.0,0.0,1.0,... | class_1| | [0.0,0.0,0.0,0.0,... | class_1| | [0.0,0.0,0.0,0.0,... |
class_1| +--------------------+--------+ only showing topUp to ten rows
 
preprocessing:defining Transformers

Up until now, we basically just read in raw data. Luckily, Spark ML had quite a few preprocessing features available, so the only thing we'll ever have to do is define TR Ansformations of data frames.

To proceed, we'll first transform category strings to double values. This was done by a so called stringindexer. Note that we carry out the actual transformation this already, but that's just for demonstration purposes. All we really need are too define string_indexer to put it into a pipeline later on.

From pyspark.ml.feature import stringindexer

string_indexer = stringindexer (inputcol= "category", outputcol= " Index_category ")
fitted_indexer = String_indexer.fit (TRAIN_DF)
INDEXED_DF = Fitted_indexer.transform (train _DF)

Next, it's good practice to normalize the features, and which is doing with a standardscaler.

From pyspark.ml.feature import standardscaler

scaler = Standardscaler (inputcol= "Features", outputcol= "scaled_ Features ", Withstd=true, withmean=true)
Fitted_scaler = Scaler.fit (INDEXED_DF)
SCALED_DF = Fitted_ Scaler.transform (INDEXED_DF)
Print ("The result of indexing and scaling. Each transformation adds new columns to the data frame: ")
Scaled_df.show (10)
The result of indexing and scaling. Each transformation adds new columns to the data frame: +--------------------+--------+--------------+-----------------            ---+
|     features|category|index_category|
scaled_features| +--------------------+--------+--------------+--------------------+
| [0.0,0.0,0.0,0.0,... |           class_8| 2.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... |           class_8| 2.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... |           class_2| 0.0| [ -0.2535060296260...| | [0.0,1.0,0.0,1.0,... |           class_6| 1.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... |           class_9| 4.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... |           class_2| 0.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... |           class_2| 0.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... |           class_3| 3.0| [ -0.2535060296260...| | [0.0,0.0,4.0,0.0,... |           class_8| 2.0| [ -0.2535060296260...| | [0.0,0.0,0.0,0.0,... |           class_7| 5.0| [ -0.2535060296260...| +--------------------+--------+--------------+--------------------+ only showing top rows
 
Keras Deep Learning model

Now, we had a data frame with processed features and labels, let's define a deep neural net the We can use to addres s the classification problem. Chances is came here because you know a thing or both about deep learning. If So, the model below would look very straightforward. We build a Keras model by choosing a set of three consecutive dense layers with dropout and ReLU activations. There is certainly much better architectures for the problem out there and we really just want to demonstrate the genera L Flow here.

From keras.models import sequential from
keras.layers.core import dense, dropout, Activation from
keras.utils Import np_utils, generic_utils

nb_classes = train_df.select ("category"). Distinct (). Count ()
Input_dim = Len ( Train_df.select ("Features"). First () [0])

model = sequential ()
model.add (Dense, input_shape= (Input_dim, ))
Model.add (Activation (' Relu '))
Model.add (Dropout (0.5))
Model.add (Dense ())
Model.add ( Activation (' Relu '))
Model.add (Dropout (0.5))
Model.add (dense (+))
Model.add (Activation (' Relu '))
Model.add (Dropout (0.5))
Model.add (Dense (nb_classes))
Model.add (Activation (' Softmax '))

Model.compile (loss= ' categorical_crossentropy ', optimizer= ' Adam ')
Distributed Elephas model

To lift the above Keras model to Spark, we define a estimator on top of it. An estimator is Spark's incarnation of a model that still have to be trained. It essentially only comes with only a single (required) method, namely fit. Once We call fit on a data frame, we get back a model, which are a trained model with a transform method to predict labels.

We do the initializing an elephasestimator and setting a few properties. As by now our input data frame would have many columns, we had to tell the model where to find features and labels by Colu MN name. Then we provide serialized versions of Keras model and Elephas optimizer. We can not plug in Keras models into the estimator directly, as Spark would has to serialize them anyway for communication With workers, so it's better to provide the serialization ourselves. In fact, while Pyspark knows how to serialize model, it's extremely inefficient and can break if models become too large. Spark ML is especially picky (and rightly so) for parameters and more or less prohibits the from providing non-atomic t Ypes and arrays of the latter. Most of the remaining parameters is optional and rather self explainatory. Plus, many of them you know if you have ever run a Keras model before. We just include them here-show the full set of training configuration.

From Elephas.ml_model import elephasestimator from Elephas import optimizers as Elephas_optimizers # Define Elephas optim Izer (which tells the model how to aggregate updates on the Spark master) Adadelta = Elephas_optimizers. Adadelta () # Initialize SPARKML Estimator and set all relevant properties estimator = Elephasestimator () estimator.setfea                 Turescol ("Scaled_features") # These, come directly from Pyspark, Estimator.setlabelcol ("Index_category") # Hence the camel case. Sorry:) estimator.set_keras_model_config (Model.to_yaml ()) # provide serialized Keras model Estimator.set_optimizer_
Config (Adadelta.get_config ()) # provide serialized Elephas optimizer Estimator.set_categorical_labels (True) Estimator.set_nb_classes (nb_classes) estimator.set_num_workers (1) # We just use one worker here.
Feel free to adapt it. Estimator.set_nb_epoch (Estimator.set_batch_size) estimator.set_verbosity (1) estimator.set_validation_ Split (0.15)
elephasestimator_415398ab22cb1699f794
SPARKML Pipelines

Now, the easy part:defining pipelines are really as easy as listing pipeline stages. We can provide any configuration of transformers and estimators really, but here we simply take the three components Defin Ed earlier. Note that String_indexer and Scaler and interchangable, while estimator somewhat obviously have to come last in the Pipelin E.

From pyspark.ml import Pipeline

Pipeline = Pipeline (Stages=[string_indexer, Scaler, estimator])
Fitting and evaluating the pipeline

The last step, the pipeline on training data and evaluate it. We evaluate, i.e transform, on training data, and since only in this case does we have labels to check accuracy of the model. Could transform the TEST_DF as well.

from pyspark.mllib.evaluation import multiclassmetrics fitted_pipeline = Pipeline.fit (train
_DF) # Fit model to data prediction = Fitted_pipeline.transform (TRAIN_DF) # Evaluate on train data.
# prediction = Fitted_pipeline.transform (TEST_DF) # <--The same code evaluates test data. PNL = Prediction.select ("Index_category", "prediction") pnl.show (+) Prediction_and_label = Pnl.map (lambda row: ( Row.index_category, row.prediction)) metrics = Multiclassmetrics (prediction_and_label) print (Metrics.precision ()) 
61878/61878 [==============================]-0s +--------------+----------+ |index_category|prediction|           +--------------+----------+
|       2.0|           2.0| |       2.0|           2.0| |       0.0|           0.0| |       1.0|           1.0| |       4.0|           4.0| |       0.0|           0.0| |       0.0|           0.0| |       3.0|           3.0| |       2.0|           2.0| |       5.0|           0.0| |       0.0|           0.0| |       4.0|           4.0| |       0.0|           0.0| |       4.0|           1.0| |       2.0|           2.0| |       1.0|           1.0| |       0.0|           0.0| |       6.0|           0.0| |       2.0|           2.0| |       1.0|           1.0| |       2.0|           2.0| |       8.0|           8.0| |       1.0|           1.0| |       5.0|           0.0| |       0.0|           0.0| |       0.0|           3.0| |       0.0|           0.0| |       1.0|           1.0| |       4.0|           4.0| |       2.0|           2.0| |     0.0|  3.0| |       3.0|           3.0| |       0.0|           0.0| |       3.0|           0.0| |       1.0|           5.0| |       3.0|           3.0| |       2.0|           2.0| |       1.0|           1.0| |       0.0|           0.0| |       2.0|           2.0| |       2.0|           2.0| |       1.0|           1.0| |       6.0|           6.0| |       1.0|           1.0| |       0.0|           3.0| |       7.0|           0.0| |       0.0|           0.0| |       0.0|           0.0| |       1.0|           1.0| |       1.0|           1.0| |       6.0|           6.0| |       0.0|           0.0| |       0.0|           3.0| |       2.0|           2.0| |       0.0|           0.0| |       2.0|           2.0| |       0.0|           0.0| |       4.0|           4.0| |       0.0|           0.0| |       6.0|           6.0| |       2.0|           5.0| |       0.0|           3.0| |       3.0|           0.0| |       0.0|           0.0| |       3.0|           3.0| |       4.0|           4.0| | 0.0|           3.0| |       0.0|           0.0| |       0.0|           0.0| |       4.0|           4.0| |       3.0|           0.0| |       2.0|           2.0| |       1.0|           1.0| |       7.0|           7.0| |       0.0|           0.0| |       0.0|           0.0| |       0.0|           3.0| |       1.0|           1.0| |       1.0|           1.0| |       5.0|           4.0| |       1.0|           1.0| |       1.0|           1.0| |       4.0|           4.0| |       3.0|           3.0| |       0.0|           0.0| |       2.0|           2.0| |       4.0|           4.0| |       7.0|           7.0| |       2.0|           2.0| |       0.0|           0.0| |       1.0|           1.0| |       0.0|           0.0| |       4.0|           4.0| |       1.0|           1.0| |       0.0|           0.0| |       0.0|           0.0| |       0.0|           0.0| |       0.0|           3.0| |       0.0|           3.0| |       0.0|
0.0| +--------------+----------+ only showing top rows 0.764132648114
 
Conclusion

It may certainly take some time to master the principles and syntax of both Keras and Spark, depending where your come from , of course. However, we also hope you come to the conclusion this once you get beyond the stage of struggeling with defining your mode LS and preprocessing your data, the business of building and using SPARKML pipelines are quite an elegant and useful one.

If you are like what do you see, consider helping further improve Elephas or contributing to Keras or Spark. Does constructive remarks on this notebook? Is there something do want me to clarify? In any case, feel free to contact me.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.