Deploy a spark cluster with a Docker installation to train CNN (with Python instances)

Source: Internet
Author: User
Tags install openssl docker hub docker run theano hadoop mapreduce keras keras model


Deploy a spark cluster with a Docker installation to train CNN (with Python instances)

This blog is only for the author to record the use of notes, there are many details of the wrong place.

Also hope that you crossing can forgive, welcome criticism correct.

Blog Although the water, but also Bo master elbow grease also.

If you want to reprint, please attach this article link , not very grateful!
http://blog.csdn.net/cyh_24/article/details/49683221


The laboratory has 4 God servers, each with 8 tesla-gpu, but usually do experiments are only used one of the GPU, is Tewoo!
So I wanted to use spark to make use of these GPUs. Having heard that Docker was the artifact of the deployment environment, it decided to use the Docker installation to deploy the spark cluster to train CNN. Configuration environment Although simple, pure labor, but the people know, there are too many pits.


This article is a tearful summary of the bloggers, hoping to provide you with some warning to avoid these pits.

Docker what is Docker


Docker is an open source project that was born in early 2013 and was originally an amateur project within the DotCloud company. Visually, Docker is a lightweight virtual machine. The difference between Docker and traditional virtualization approaches is that:


Docker is virtualized at the operating system level , directly reusing the local host's operating system, while the traditional approach is implemented at the hardware level .


A graph is a more intuitive explanation of these two differences:





Why use Docker


As an emerging virtualization approach, Docker has a number of advantages over traditional virtualization approaches.


    • The launch of the Docker container can be implemented in seconds, which is much faster than the traditional virtual machine approach.
    • Docker is highly utilized for system resources and can run thousands of Docker containers at the same time on a single host.
    • In addition to running applications, the container does not consume any additional system resources, which makes the performance of the application very high and the overhead of the system as small as possible. (Traditional virtual machines run 10 different applications with 10 virtual machines, and Docker only needs to start 10 isolated apps).
    • Once created or configured, it can be run anywhere.
    • Docker containers can run on virtually any platform, including physical machines, virtual machines, public clouds, private clouds, personal computers, servers, and more. This compatibility allows the user to migrate an application directly from one platform to another.


Briefly summarize:


features Docker Virtual Machines
Start Second level Minute level
Hard disk use is generally MB Typically GB
Performance Close to native Weaker than
System Support Volume Stand-alone support for thousands of containers Dozens of general
Spark


Spark is a common parallel framework for the open source class Hadoop MapReduce for UC Berkeley AMP Labs.
Spark, with the advantages of Hadoop mapreduce;
But unlike MapReduce, the job intermediate output can be stored in memory, eliminating the need to read and write HDFs.


As a result, Spark is better suited to algorithms that require iterative MapReduce such as data mining and machine learning .


About the principle of spark application, and so on, there is not much to say, another day I write a separate to chat. Now you just have to know that it can get your program distributed and run.


Elephas (Deep Learning Library with spark support)


First say Keras, it is based on Theano deep Learning Library, used Theano may know, Theano program is not particularly better. Keras is a high-level package for Theano, making the code easier to write, with a Keras CNN model code below:

model = Sequential ()

model.add (Convolution2D (nb_filters, nb_conv, nb_conv,
                        border_mode = ‘full’,
                        input_shape = (1, img_rows, img_cols)))
model.add (Activation (‘relu’))
model.add (Convolution2D (nb_filters, nb_conv, nb_conv))
model.add (Activation (‘relu’))
model.add (MaxPooling2D (pool_size = (nb_pool, nb_pool)))
model.add (Dropout (0.25))

model.add (Flatten ())
model.add (Dense (128))
model.add (Activation (‘relu’))
model.add (Dropout (0.5))
model.add (Dense (nb_classes))
model.add (Activation (‘softmax’))

model.compile (loss = ‘categorical_crossentropy’, optimizer = ‘adadelta’)

model.fit (X_train, Y_train, batch_size = batch_size, nb_epoch = nb_epoch, show_accuracy = True, verbose = 1, validation_data = (X_test, Y_test))
Is it simpler than caffe's configuration file?

elephas enables keras programs to run on Spark. So that basically does not change keras, you can run the program on the spark.
Here is a code for elephas (the model is the model above):

# Create Spark context
conf = SparkConf (). setAppName (‘Mnist_Spark_MLP’). setMaster (‘local [8]’)
sc = SparkContext (conf = conf)

# Build RDD from numpy features and labels
rdd = to_simple_rdd (sc, X_train, Y_train)

# Initialize SparkModel from Keras model and Spark context
spark_model = SparkModel (sc, model)

# Train Spark model
spark_model.train (rdd, nb_epoch = nb_epoch, batch_size = batch_size, verbose = 0, validation_split = 0.1, num_workers = 8)
To run on spark, just execute the following command:

spark-submit --driver-memory 1G ./your_script.py

All the introductions have been introduced. Let me teach you how to use docker to install and deploy Spark-GPU cluster to distributed training CNN.

Spark on docker installation Install docker online
Ubuntu 14.04 version system already comes with Docker package, which can be installed directly.

$ sudo apt-get update
$ sudo apt-get install -y docker.io
$ sudo ln -sf /usr/bin/docker.io / usr / local / bin / docker
$ sudo sed -i ‘$ acomplete -F _docker docker‘ /etc/bash_completion.d/docker.io
If it is a lower version of Ubuntu system, you need to update the kernel first.

$ sudo apt-get update
$ sudo apt-get install linux-image-generic-lts-raring linux-headers-generic-lts-raring
$ sudo reboot
Then repeat the above steps.
After installation, start the Docker service.

$ sudo service docker start
Install docker offline
If your computer can't connect to the Internet (like my server), you can also install docker through the offline installation package.
You can download the offline package from here: https://get.daocloud.io/docker/builds/Linux/x86_64/docker-latest

chmod + x docker-latest
sudo mv docker-latest / usr / local / bin / docker
# Then start docker in daemon mode:
sudo docker daemon &
Spark on docker installation
Sequenceiq provides a docker container with spark installed in it. You only need to pull down from the docker hub.

docker pull sequenceiq / spark: 1.5.1

Run the following command to run it:

sudo docker run -it sequenceiq / spark: 1.5.1 bash

Test the function of spark:
First use ifconfig to get the ip address, my ip is 172.17.0.109, then:

bash-4.1 # cd / usr / local / spark
bash-4.1 # cp conf / spark-env.sh.template conf / spark-env.sh
bash-4.1 # vi conf / spark-env.sh

Add two lines of code:

export SPARK_LOCAL_IP = 172.17.0.109
export SPARK_MASTER_IP = 172.17.0.109
Then start the master and slave:

bash-4.1 # ./sbin/start-master.sh
bash-4.1 # ./sbin/start-slave.sh spark: 172.17.0.109: 7077

Open the browser (your ip: 8080) and you can see the status of each spark node as follows.

Use spark-sumit to submit an application to run:

bash-4.1 # ./bin/spark-submit examples / src / main / python / pi.py

Get the following results:

15/11/05 02:11:23 INFO scheduler.DAGScheduler: Job 0 finished: reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py: 39, took 1.095643 s

Pi is roughly 3.148900

Congratulations, I just ran a spark app!

Do you think everything has gone smoothly so far? Spoiler in advance, the difficulty has just begun, but fortunately I stepped on the pit, so although it is still a little trouble, but at least you still bypassed some deep pits. . .

Installation of various libraries
elephas requires python2.7, but the python that we just installed with docker is version 2.6, so we first update the python version.

CentOS Python version upgrade
Warm reminder: you must install openssl and openssl-devel before python compilation, don't ask me how I know.

yum install -y zlib-devel bzip2-devel openssl openssl-devel xz-libs wget
Installation details:

wget http://www.python.org/ftp/python/2.7.8/Python-2.7.8.tar.xz
xz -d Python-2.7.8.tar.xz
tar -xvf Python-2.7.8.tar

# Enter the directory:
cd Python-2.7.8
# Run configuration configure:
./configure --prefix = / usr / local CFLAGS = -fPIC (be sure to add fPIC, do not ask me how to know)
# Compile and install:
make
make altinstall
Set PATH

mv / usr / bin / python /usr/bin/python2.6
export PATH = "/ usr / local / bin: $ PATH"
or
ln -s /usr/local/bin/python2.7 / usr / bin / python
# Check Python version:
python -V
Install setuptools

#Get package
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz
# Unzip:
tar -xvf setuptools-1.4.2.tar.gz
cd setuptools-1.4.2
# Use Python 2.7.8 to install setuptools
python setup.py install
Install PIP

curl https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py | python-
Repair yum tool

vi / usr / bin / yum

#Modify python in yum
Put the first line #! / Usr / bin / python
Change to #! / Usr / bin / python2.6
At this time, yum is ok
installation of theano, keras, elephas
pip install --upgrade --no-deps git + git: //github.com/Theano/Theano.git

pip install keras

pip install elephas
Skills achieved
We briefly summarize the work we have done:

Install docker
The spark on docker image is loaded
Upgrade the python in the spark on docker image
Installed theano, keras, elephas
Now, what we can do:
√ If your machine has multiple CPUs (assuming 24):

You can just open a docker, and then simply use spark combined with elephas to calculate CNN in parallel (using 24 CPUs).

√ If your machine has multiple GPUs (assuming 4):

You can open 4 docker images and modify the ~ / .theanorc in each image to select a specific GPU for parallel (4 GPU) calculations. (Need to install cuda)

A single machine multi-CPU cluster parallel training CNN example
Run a simple network to train mnist handwriting recognition and paste a code that can be run directly (mnist.pkl.gz must be downloaded in advance)

from __future__ import absolute_import
from __future__ import print_function
import numpy as np

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.optimizers import SGD, Adam, RMSprop
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils

from elephas.spark_model import SparkModel
from elephas.utils.rdd_utils import to_simple_rdd

from pyspark import SparkContext, SparkConf

import gzip
import cPickle

APP_NAME = "mnist"
MASTER_IP = ‘local [24]’

# Define basic parameters
batch_size = 128
nb_classes = 10
nb_epoch = 5

# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size
nb_conv = 3

# Load data
f = gzip.open ("./ mnist.pkl.gz", "rb")
dd = cPickle.load (f)
(X_train, y_train), (X_test, y_test) = dd

X_train = X_train.reshape (X_train.shape [0], 1, img_rows, img_cols)
X_test = X_test.reshape (X_test.shape [0], 1, img_rows, img_cols)

X_train = X_train.astype ("float32")
X_test = X_test.astype ("float32")
X_train / = 255
X_test / = 255


print (X_train.shape [0], ‘train samples’)
print (X_test.shape [0], ‘test samples’)

# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical (y_train, nb_classes)
Y_test = np_utils.to_categorical (y_test, nb_classes)

model = Sequential ()
model.add (Convolution2D (nb_filters, nb_conv, nb_conv,
                        border_mode = ‘full’,
                        input_shape = (1, img_rows, img_cols)))
model.add (Activation (‘relu’))
model.add (Convolution2D (nb_filters, nb_conv, nb_conv))
model.add (Activation (‘relu’))
model.add (MaxPooling2D (pool_size = (nb_pool, nb_pool)))
model.add (Dropout (0.25))

model.add (Flatten ())
model.add (Dense (128))
model.add (Activation (‘relu’))
model.add (Dropout (0.5))
model.add (Dense (nb_classes))
model.add (Activation (‘softmax’))

model.compile (loss = ‘categorical_crossentropy’, optimizer = ‘adadelta’)

## spark
conf = SparkConf (). setAppName (APP_NAME) .setMaster (MASTER_IP)
sc = SparkContext (conf = conf)

# Build RDD from numpy features and labels
rdd = to_simple_rdd (sc, X_train, Y_train)

# Initialize SparkModel from Keras model and Spark context
spark_model = SparkModel (sc, model)

# Train Spark model
spark_model.train (rdd, nb_epoch = nb_epoch, batch_size = batch_size, verbose = 0, validation_split = 0.1, num_workers = 24)

# Evaluate Spark model by evaluating the underlying model
score = spark_model.get_network (). evaluate (X_test, Y_test, show_accuracy = True, verbose = 2)
print (‘Test accuracy:‘, score [1])
Run the following command to run:

/ usr / local / spark / bin / spark-submit mnist_cnn_spark.py

Using 24 slaves and iterating 5 times in parallel, the accuracy and running time are as follows:

Test accuracy: 95.68%
took: 1135s

Without using spark, probably measured it, it takes 1800s for 1 iteration, so it is still 7 ~ 8 times faster.

Multi-GPU cluster parallel training CNN instance
Since the blogger has stepped on too many pits in recent days, I feel so tired!
Regarding the configuration of single-machine multi-GPU clusters and multi-machine multi-GPU clusters, please wait a few days for the bloggers to recover, and continue to step on the pit without hesitation. . .

For Red Flame Army, I will come back!

Copyright statement: If you need to reprint, please attach a link to this article, thank you very much!

Use docker to install and deploy Spark cluster to train CNN (including Python instance)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.