TensorFlow Learning Notes 4: Distributed TensorFlow

Last Update:2018-07-28 Source: Internet

Author: User

Tags docker run

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

TensorFlow Learning Notes 4: Distributed TensorFlow Brief Introduction

The TensorFlow API provides cluster, server, and supervisor to support distributed training of models.

The distributed training introduction about TensorFlow can refer to distributed TensorFlow. A simple overview is as follows: The TensorFlow distributed cluster consists of multiple tasks, each of which corresponds to a tf.train.Server instance, as a separate node of cluster; multiple tasks of the same function can be divided into a job, For example, PS job saves only the parameters of the TensorFlow model as a parameter server, while the worker job performs only compute-intensive graph calculations as a compute node. tasks in cluster communicate with each other for state synchronization, parameter updates, and so on.

The code executed by all nodes of the TensorFlow distributed cluster is the same. The distributed task code has a fixed pattern:

# 1th Step: Command line parameter parsing, get cluster information ps_hosts and worker_hosts, and current node role information job_name and Task_index

# 2nd: Create the server
for the current task node Cluster = Tf.train.ClusterSpec ({"PS": ps_hosts, "worker": worker_hosts})
server = Tf.train.Server (cluster, job_ Name=flags.job_name, Task_index=flags.task_index)

# Step 3rd: If the current node is PS, call Server.join () endless wait; if it is a worker, then step 4th.
if Flags.job_name = = "PS":
    server.join ()

# Step 4th: Build the model to be trained
# builds TensorFlow graph model

# Step 5th: Create a tf.train.Supervisor to manage the model's training process
# Create a ' supervisor ', which oversees the training process.
SV = Tf.train.Supervisor (is_chief= (flags.task_index = 0), logdir= "/tmp/train_logs") # The supervisor takes care of
Session initialization and restoring from a checkpoint.
Sess = Sv.prepare_or_wait_for_session (server.target)
# Loop until the supervisor shuts down while not
Sv.should_stop ()
     # Train model

TensorFlow Distributed Training Code Framework

According to the TensorFlow distributed training code fixed pattern mentioned above, if you want to write a distributed TENSORLFOW code, the framework below is shown below.

Import TensorFlow as TF # Flags for defining the Tf.train.ClusterSpec tf.app.flags.DEFINE_string ("Ps_hosts", "",
                           "Comma-separated List of Hostname:port pairs") tf.app.flags.DEFINE_string ("Worker_hosts", "", "Comma-separated List of Hostname:port pairs") # Flags for defining the Tf.train.Server Tf.app.flags.DE Fine_string ("Job_name", "", "one of ' ps ', ' worker '") Tf.app.flags.DEFINE_integer ("Task_index", 0, "Index of task within th E job ") FLAGS = Tf.app.flags.FLAGS def Main (_): Ps_hosts = FLAGS.ps_hosts.split (", ") worker_hosts = Flags.worker_ho
  STS (",") # Create a cluster from the parameter server and worker hosts. Cluster = Tf.train.ClusterSpec ({"PS": ps_hosts, "worker": Worker_hosts}) # Create and start a server for the local task
  . Server = Tf.train.Server (cluster, Job_name=flags.job_name, Task_inde X=flags.task_index) If Flags.job_name = = "PS": server.join() elif Flags.job_name = = "worker": # assigns OPS to the local worker by default. With Tf.device (Tf.train.replica_device_setter (worker_device= "/job:worker/task:%d"% Flags.task_index, CLU Ster=cluster): # build model ... loss = ... global_step = tf. Variable (0) Train_op = Tf.train.AdagradOptimizer (0.01). Minimize (loss, global_step=global_step) SA ver = tf.train.Saver () Summary_op = Tf.merge_all_summaries () Init_op = Tf.initialize_all_variables () # Cr
    Eate a "supervisor", which oversees the training process.
                             SV = Tf.train.Supervisor (is_chief= (flags.task_index = 0), logdir= "/tmp/train_logs", Init_op=init_op, Summary_op=summary_op, sav Er=saver, Global_step=global_step, save_model_secs=600) # T He supervisor takes care of Session initialization and restoring from # a checkpoint.
    Sess = Sv.prepare_or_wait_for_session (server.target) # Start Queue runners for the ' input ' pipelines (if any).
    Sv.start_queue_runners (sess) # Loop until the supervisor shuts down (or 1000000 steps have).
      Step = 0 While not sv.should_stop () and Step < 1000000: # Run A training step asynchronously.
      # the ' Tf.train.SyncReplicasOptimizer ' for additional details on the How to # perform *synchronous* training. _, Step = Sess.run ([Train_op, Global_step]) if __name__ = = "__main__": Tf.app.run ()

For all TensorFlow distributed code, only two points are available: build TensorFlow graph model code; Execute a trained code distributed mnist task each step

We construct distributed mnist samples for verification by modifying the mnist_softmax.py provided by Tensorflow/tensorflow. Please refer to mnist_dist.py for the modified code.

We also use the Tensorlfow Docker image to start a container for verification.

$ docker run-d-v/path/to/your/code:/tensorflow/mnist--name TensorFlow tensorflow/tensorflow

After starting the TensorFlow, start 4 terminal, then enter the TensorFlow container via the command below to switch to the/tensorflow/mnist directory

$ docker exec-ti Tensorflow/bin/bash
$ cd/tensorflow/mnist

Then execute one of the following commands in four terminal to start a task node in TensorFlow cluster,

# Start PS 0
python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222--worker_hosts=localhost:2223, localhost:2224--job_name=ps--task_index=0

# Start PS 1
python mnist_dist.py--ps_hosts=localhost:2221, localhost:2222--worker_hosts=localhost:2223,localhost:2224--job_name=ps--task_index=1

# Start worker 0
Python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222--worker_hosts=localhost:2223,localhost:2224--job_ Name=worker--task_index=0

# Start worker 1
python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222-- worker_hosts=localhost:2223,localhost:2224--job_name=worker--task_index=1

Specific effect of their own verification ha.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More