Study Notes TF065: TensorFlowOnSpark,

Source: Internet
Author: User
Tags pyspark hadoop ecosystem hadoop fs

Study Notes TF065: TensorFlowOnSpark,

Hadoop ecosystem big data systems are divided into Yam, HDFS, and MapReduce computing frameworks. The distributed TensorFlow is equivalent to the MapReduce computing framework, and Kubernetes is equivalent to the Yam scheduling system. TensorFlowOnSpark uses Remote Direct Memory Access (RDMA) to solve storage functions and scheduling, enabling deep learning and big data integration. TensorFlowOnSpark (TFoS) is an open-source Yahoo project. Https://github.com/yahoo/TensorFlowOnSpark. ApacheSpark cluster distributed TensorFlow training and prediction are supported. TensorFlowOnSpark provides a bridge program. Each Spark Executor starts a TensorFlow process and interacts with each other through remote process Communication (RPC.

TensorFlowOnSpark architecture. TensorFlow training programs run in Spark clusters and manage Spark cluster steps: Reserved. Execute each TensorFlow process in Executor to keep a port and start the data message listener. Start, start the TensorFlow main function in Executor. Data Acquisition: TensorFlow Readers and QueueRunners mechanisms directly read HDFS data files, Spark does not access data; Feeding and SparkRDD data is sent to TensorFlow nodes, and data is transmitted to TensorFlow computing charts through feed_dict. Close and close Executor TensorFlow computing nodes and parameter service nodes. Spark Driver-> Spark Executor-> parameter server-> TensorFlow Core-> gRPC, RDMA-> HDFS dataset. Http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep.

TensorFlowOnSpark MNIST. Https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_standalone. Standalone mode Spark cluster, one computer. Install Spark and Hadoop. Deploy Java 1.8.0 JDK. Download the Spark2.1.0 http://spark.apache.org/downloads.html. Download the http://hadoop.apache.org/#Download+Hadoop for Hadoop2.7.3. 0.12.1 supports better performance.
Modify the configuration file, set environment variables, and start Hadoop: $ HADOOP_HOME/sbin/start-all.sh. Check out the source code of TensorFlowOnSpark:

Git clone -- recurse-submodules https://github.com/yahoo/TensorFlowOnSpark.git
Cd TensorFlowOnSpark
Git submodule init
Git submodule update -- force
Git submodule foreach -- recursive git clean-dfx

Package the source code and submit the task:

Cd TensorflowOnSpark/src
Zip-r ../tfspark.zip *

Set the TensorFlowOnSpark root directory environment variable:

Cd TensorFlowOnSpark
Export TFoS_HOME = $ (pwd)

Start the Spark master node (master ):

$ (SPARK_HOME)/sbin/start-master.sh

Configure two worker instances and master-spark-URL to connect to the master node:

Export MASTER = spark: // $ (hostname): 7077
Export SPARK_WORKER_INSTANCES = 2
Export CORES_PER_WORKER = 1
Export TOTAL_CORES = $ (CORES_PER_WORKER) * $ (SPARK_WORKER_INSTANCES )))
$ (SPARK_HOME)/sbin/start-slave.sh-c $ CORES_PER_WORKER-m 3G $ (MASTER)

Submit the task and convert the MNIST zip file to the hdfs rdd Dataset:

$ (SPARK_HOME)/bin/spark-submit \
-- Master $ (MASTER) -- conf spark. ui. port = 4048 -- verbose \
$ (TFoS_HOME)/examples/mnist/mnist_data_setup.py \
-- Output examples/mnist/csv \
-- Format csv

View processed Datasets:

Hadoop fs-ls hdfs: // localhost: 9000/user/libinggen/examples/mnist/csv

View saved images and tag vectors:

Hadoop fs-ls hdfs: // localhost: 9000/user/libinggen/examples/mnist/csv/train/labels

Save the RDD data of the training set and test set respectively.
Https://github.com/yahoo/TensorFlowOnSpark/blob/master/examples/mnist/mnist_data_setup.py.

From _ future _ import absolute_import
From _ future _ import division
From _ future _ import print_function
Import numpy
Import tensorflow as tf
From array import array
From tensorflow. contrib. learn. python. learn. datasets import mnist
Def toTFExample (image, label ):
"Serializes an image/label as a TFExample byte string """
Example = tf. train. Example (
Features = tf. train. Features (
Feature = {
'Label': tf. train. Feature (int64_list = tf. train. Int64List (value = label. astype ("int64 "))),
'Image': tf. train. Feature (int64_list = tf. train. Int64List (value = image. astype ("int64 ")))
}
)
)
Return example. SerializeToString ()
Def fromTFExample (bytestr ):
"Deserializes a TFExample from a byte string """
Example = tf. train. Example ()
Example. ParseFromString (bytestr)
Return example
Def toCSV (vec ):
"Converts a vector/array into a CSV string """
Return ','. join ([str (I) for I in vec])
Def fromCSV (s ):
"Converts a CSV string to a vector/array """
Return [float (x) for x in s. split (',') if len (s)> 0]
Def writeMNIST (SC, input_images, input_labels, output, format, num_partitions ):
"Writes MNIST image/label vectors into parallreceived files on HDFS """
# Load MNIST gzip into memory
# Write MNIST images and tag vectors into HDFS
With open (input_images, 'rb') as f:
Images = numpy. array (mnist. extract_images (f ))
With open (input_labels, 'rb') as f:
If format = "csv2 ":
Labels = numpy. array (mnist. extract_labels (f, one_hot = False ))
Else:
Labels = numpy. array (mnist. extract_labels (f, one_hot = True ))
Shape = images. shape
Print ("images. shape: {0}". format (shape) #60000x28x28
Print ("labels. shape: {0}". format (labels. shape) #60000x10
# Create RDDs of vectors
ImageRDD = SC. parallelize (images. reshape (shape [0], shape [1] * shape [2]), num_partitions)
LabelRDD = SC. parallelize (labels, num_partitions)
Output_images = output + "/images"
Output_labels = output + "/labels"
# Save RDDs as specific format
# RDDs save a specific format
If format = "pickle ":
ImageRDD. saveAsPickleFile (output_images)
LabelRDD. saveAsPickleFile (output_labels)
Elif format = "csv ":
ImageRDD. map (toCSV). saveAsTextFile (output_images)
LabelRDD. map (toCSV). saveAsTextFile (output_labels)
Elif format = "csv2 ":
ImageRDD.map(toCSV).zip (labelRDD). map (lambda x: str (x [1]) + "|" + x [0]). saveAsTextFile (output)
Else: # format = "tfr ":
TfRDD = imageRDD.zip (labelRDD). map (lambda x: (bytearray (toTFExample (x [0], x [1]), None ))
# Requires: -- jars tensorflow-hadoop-1.0-SNAPSHOT.jar
TfRDD. saveAsNewAPIHadoopFile (output, "org. tensorflow. hadoop. io. TFRecordFileOutputFormat ",
KeyClass = "org. apache. hadoop. io. BytesWritable ",
ValueClass = "org. apache. hadoop. io. NullWritable ")
# Note: this creates TFRecord files w/o requiring a custom Input/Output format
# Else: # format = "tfr ":
# Def writeTFRecords (index, iter ):
# Output_path = "{0}/part-{1: 05d}". format (output, index)
# Writer = tf. python_io.TFRecordWriter (output_path)
# For example in iter:
# Writer. write (example)
# Return [output_path]
# TfRDD = imageRDD.zip (labelRDD). map (lambda x: toTFExample (x [0], x [1])
# TfRDD. mapPartitionsWithIndex (writeTFRecords). collect ()
Def readMNIST (SC, output, format ):
"Reads/verifies previusly created output """
Output_images = output + "/images"
Output_labels = output + "/labels"
ImageRDD = None
LabelRDD = None
If format = "pickle ":
ImageRDD = SC. pickleFile (output_images)
LabelRDD = SC. pickleFile (output_labels)
Elif format = "csv ":
ImageRDD = SC. textFile (output_images). map (fromCSV)
LabelRDD = SC. textFile (output_labels). map (fromCSV)
Else: # format. startswith ("tf "):
# Requires: -- jars tensorflow-hadoop-1.0-SNAPSHOT.jar
TfRDD = SC. newAPIHadoopFile (output, "org. tensorflow. hadoop. io. TFRecordFileInputFormat ",
KeyClass = "org. apache. hadoop. io. BytesWritable ",
ValueClass = "org. apache. hadoop. io. NullWritable ")
ImageRDD = tfRDD. map (lambda x: fromTFExample (str (x [0])
Num_images = imageRDD. count ()
Num_labels = labelRDD. count () if labelRDD is not None else num_images
Samples = imageRDD. take (10)
Print ("num_images:", num_images)
Print ("num_labels:", num_labels)
Print ("samples:", samples)
If _ name _ = "_ main __":
Import argparse
From pyspark. context import SparkContext
From pyspark. conf import SparkConf
Parser = argparse. ArgumentParser ()
Parser. add_argument ("-f", "-- format", help = "output format", choices = ["csv", "csv2", "pickle", "tf ", "tfr"], default = "csv ")
Parser. add_argument ("-n", "-- num-partitions", help = "Number of output partitions", type = int, default = 10)
Parser. add_argument ("-o", "-- output", help = "HDFS directory to save examples in parallreceived format", default = "mnist_data ")
Parser. add_argument ("-r", "-- read", help = "read previusly saved examples", action = "store_true ")
Parser. add_argument ("-v", "-- verify", help = "verify saved examples after writing", action = "store_true ")
Args = parser. parse_args ()
Print ("args:", args)
SC = SparkContext (conf = SparkConf (). setAppName ("mnist_parallelize "))
If not args. read:
# Note: these files are inside the mnist.zip file
WriteMNIST (SC, "mnist/train-images-idx3-ubyte.gz", "mnist/train-labels-idx1-ubyte.gz", args. output + "/train", args. format, args. num_partitions)
WriteMNIST (SC, "mnist/t10k-images-idx3-ubyte.gz", "mnist/t10k-labels-idx1-ubyte.gz", args. output + "/test", args. format, args. num_partitions)
If args. read or args. verify:
ReadMNIST (SC, args. output + "/train", args. format)

Submit the training task and start training. Generate mnist_model in HDFS. Run the following command:

$ {SPARK_HOME}/bin/spark-submit \
-- Master $ {MASTER }\
-- Py-files $ {TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
-- Conf spark. cores. max =$ {TOTAL_CORES }\
-- Conf spark. task. cpus =$ {CORES_PER_WORKER }\
-- Conf spark.exe cutorEnv. JAVA_HOME = "$ JAVA_HOME "\
$ {TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
-- Cluster_size $ {SPARK_WORKER_INSTANCES }\
-- Images examples/mnist/csv/train/images \
-- Labels examples/mnist/csv/train/labels \
-- Format csv \
-- Mode train \
-- Model mnist_model

Mnist_dist.py constructs a TensorFlow distributed task, defines the main function of the distributed task, and starts the main function map_fun of TensorFlow. The data acquisition method is Feeding. Obtain the TensorFlow cluster and server instance:

Cluster, server = TFNode. start_cluster_server (ctx, 1, args. rdma)

Tfnodecall the tfspark.zip TFNode. py file.

The mnist_spark.py file is the master program for training. TensorFlowOnSpark deployment steps:

From _ future _ import absolute_import
From _ future _ import division
From _ future _ import print_function
From pyspark. context import SparkContext
From pyspark. conf import SparkConf
Import argparse
Import OS
Import numpy
Import sys
Import tensorflow as tf
Import threading
Import time
From datetime import datetime
From tensorflowonspark import TFCluster
Import mnist_dist
SC = SparkContext (conf = SparkConf (). setAppName ("mnist_spark "))
Executors = SC. _ conf. get ("spark.exe cutor. instances ")
Num_executors = int (executors) if executors is not None else 1
Num_ps = 1
Parser = argparse. ArgumentParser ()
Parser. add_argument ("-B", "-- batch_size", help = "number of records per batch", type = int, default = 100)
Parser. add_argument ("-e", "-- epochs", help = "number of epochs", type = int, default = 1)
Parser. add_argument ("-f", "-- format", help = "example format: (csv | pickle | tfr)", choices = ["csv", "pickle ", "tfr"], default = "csv ")
Parser. add_argument ("-I", "-- images", help = "HDFS path to MNIST images in parallreceived format ")
Parser. add_argument ("-l", "-- labels", help = "HDFS path to MNIST labels in parallreceived format ")
Parser. add_argument ("-m", "-- model", help = "HDFS path to save/load model during train/inference", default = "mnist_model ")
Parser. add_argument ("-n", "-- cluster_size", help = "number of nodes in the cluster", type = int, default = num_executors)
Parser. add_argument ("-o", "-- output", help = "HDFS path to save test/inference output", default = "predictions ")
Parser. add_argument ("-r", "-- readers", help = "number of reader/enqueue threads", type = int, default = 1)
Parser. add_argument ("-s", "-- steps", help = "maximum number of steps", type = int, default = 1000)
Parser. add_argument ("-tb", "-- tensorboard", help = "launch tensorboard process", action = "store_true ")
Parser. add_argument ("-X", "-- mode", help = "train | inference", default = "train ")
Parser. add_argument ("-c", "-- rdma", help = "use rdma connection", default = False)
Args = parser. parse_args ()
Print ("args:", args)
Print ("{0 }==== Start". format (datetime. now (). isoformat ()))
If args. format = "tfr ":
Images = SC. newAPIHadoopFile (args. images, "org. tensorflow. hadoop. io. TFRecordFileInputFormat ",
KeyClass = "org. apache. hadoop. io. BytesWritable ",
ValueClass = "org. apache. hadoop. io. NullWritable ")
Def toNumpy (bytestr ):
Example = tf. train. Example ()
Example. ParseFromString (bytestr)
Features = example. features. feature
Image = numpy. array (features ['image']. int64_list.value)
Label = numpy. array (features ['label']. int64_list.value)
Return (image, label)
DataRDD = images. map (lambda x: toNumpy (str (x [0])
Else:
If args. format = "csv ":
Images = SC. textFile (args. images). map (lambda ln: [int (x) for x in ln. split (',')])
Labels = SC. textFile (args. labels). map (lambda ln: [float (x) for x in ln. split (',')])
Else: # args. format = "pickle ":
Images = SC. pickleFile (args. images)
Labels = SC. pickleFile (args. labels)
Print ("zipping images and labels ")
DataRDD = images.zip (labels)
#1. Reserve a port for executing each TensorFlow process on Executor
Cluster = TFCluster. run (SC, mnist_dist.map_fun, args, args. cluster_size, num_ps, args. tensorboard, TFCluster. InputMode. SPARK)
#2. Start the Tensorflow Main Function
Cluster. start (mnist_dist.map_fun, args)
If args. mode = "train ":
#3. Training
Cluster. train (dataRDD, args. epochs)
Else:
#3. Prediction
LabelRDD = cluster. inference (dataRDD)
LabelRDD. saveAsTextFile (args. output)
#4. Disable Executor TensorFlow computing nodes and parameter service nodes
Cluster. shutdown ()
Print ("{0 }==== Stop". format (datetime. now (). isoformat ()))

Prediction command:

$ {SPARK_HOME}/bin/spark-submit \
-- Master $ {MASTER }\
-- Py-files $ {TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
-- Conf spark. cores. max =$ {TOTAL_CORES }\
-- Conf spark. task. cpus =$ {CORES_PER_WORKER }\
-- Conf spark.exe cutorEnv. JAVA_HOME = "$ JAVA_HOME "\
$ {TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
-- Cluster_size $ {SPARK_WORKER_INSTANCES }\
-- Images examples/mnist/csv/test/images \
-- Labels examples/mnist/csv/test/labels \
-- Mode inference \
-- Format csv \
-- Model mnist_model \
-- Output predictions

Amazon EC2 and YARN mode can also be used in Hadoop clusters.

References:
Analysis and Practice of TensorFlow Technology

Welcome to the Shanghai Machine Learning job opportunity, my qingxingfengzi

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.