TensorFlow Learning Notes 4: Distributed TensorFlow

Source: Internet
Author: User
Tags docker run
TensorFlow Learning Notes 4: Distributed TensorFlow Brief Introduction

The TensorFlow API provides cluster, server, and supervisor to support distributed training of models.

The distributed training introduction about TensorFlow can refer to distributed TensorFlow. A simple overview is as follows: The TensorFlow distributed cluster consists of multiple tasks, each of which corresponds to a tf.train.Server instance, as a separate node of cluster; multiple tasks of the same function can be divided into a job, For example, PS job saves only the parameters of the TensorFlow model as a parameter server, while the worker job performs only compute-intensive graph calculations as a compute node. tasks in cluster communicate with each other for state synchronization, parameter updates, and so on.

The code executed by all nodes of the TensorFlow distributed cluster is the same. The distributed task code has a fixed pattern:

# 1th Step: Command line parameter parsing, get cluster information ps_hosts and worker_hosts, and current node role information job_name and Task_index

# 2nd: Create the server
for the current task node Cluster = Tf.train.ClusterSpec ({"PS": ps_hosts, "worker": worker_hosts})
server = Tf.train.Server (cluster, job_ Name=flags.job_name, Task_index=flags.task_index)

# Step 3rd: If the current node is PS, call Server.join () endless wait; if it is a worker, then step 4th.
if Flags.job_name = = "PS":
    server.join ()

# Step 4th: Build the model to be trained
# builds TensorFlow graph model

# Step 5th: Create a tf.train.Supervisor to manage the model's training process
# Create a ' supervisor ', which oversees the training process.
SV = Tf.train.Supervisor (is_chief= (flags.task_index = 0), logdir= "/tmp/train_logs") # The supervisor takes care of
Session initialization and restoring from a checkpoint.
Sess = Sv.prepare_or_wait_for_session (server.target)
# Loop until the supervisor shuts down while not
Sv.should_stop ()
     # Train model

TensorFlow Distributed Training Code Framework

According to the TensorFlow distributed training code fixed pattern mentioned above, if you want to write a distributed TENSORLFOW code, the framework below is shown below.

Import TensorFlow as TF # Flags for defining the Tf.train.ClusterSpec tf.app.flags.DEFINE_string ("Ps_hosts", "",
                           "Comma-separated List of Hostname:port pairs") tf.app.flags.DEFINE_string ("Worker_hosts", "", "Comma-separated List of Hostname:port pairs") # Flags for defining the Tf.train.Server Tf.app.flags.DE Fine_string ("Job_name", "", "one of ' ps ', ' worker '") Tf.app.flags.DEFINE_integer ("Task_index", 0, "Index of task within th E job ") FLAGS = Tf.app.flags.FLAGS def Main (_): Ps_hosts = FLAGS.ps_hosts.split (", ") worker_hosts = Flags.worker_ho
  STS (",") # Create a cluster from the parameter server and worker hosts. Cluster = Tf.train.ClusterSpec ({"PS": ps_hosts, "worker": Worker_hosts}) # Create and start a server for the local task
  . Server = Tf.train.Server (cluster, Job_name=flags.job_name, Task_inde X=flags.task_index) If Flags.job_name = = "PS": server.join() elif Flags.job_name = = "worker": # assigns OPS to the local worker by default. With Tf.device (Tf.train.replica_device_setter (worker_device= "/job:worker/task:%d"% Flags.task_index, CLU Ster=cluster): # build model ... loss = ... global_step = tf. Variable (0) Train_op = Tf.train.AdagradOptimizer (0.01). Minimize (loss, global_step=global_step) SA ver = tf.train.Saver () Summary_op = Tf.merge_all_summaries () Init_op = Tf.initialize_all_variables () # Cr
    Eate a "supervisor", which oversees the training process.
                             SV = Tf.train.Supervisor (is_chief= (flags.task_index = 0), logdir= "/tmp/train_logs", Init_op=init_op, Summary_op=summary_op, sav Er=saver, Global_step=global_step, save_model_secs=600) # T He supervisor takes care of Session initialization and restoring from # a checkpoint.
    Sess = Sv.prepare_or_wait_for_session (server.target) # Start Queue runners for the ' input ' pipelines (if any).
    Sv.start_queue_runners (sess) # Loop until the supervisor shuts down (or 1000000 steps have).
      Step = 0 While not sv.should_stop () and Step < 1000000: # Run A training step asynchronously.
      # the ' Tf.train.SyncReplicasOptimizer ' for additional details on the How to # perform *synchronous* training. _, Step = Sess.run ([Train_op, Global_step]) if __name__ = = "__main__": Tf.app.run ()

For all TensorFlow distributed code, only two points are available: build TensorFlow graph model code; Execute a trained code distributed mnist task each step

We construct distributed mnist samples for verification by modifying the mnist_softmax.py provided by Tensorflow/tensorflow. Please refer to mnist_dist.py for the modified code.

We also use the Tensorlfow Docker image to start a container for verification.

$ docker run-d-v/path/to/your/code:/tensorflow/mnist--name TensorFlow tensorflow/tensorflow

After starting the TensorFlow, start 4 terminal, then enter the TensorFlow container via the command below to switch to the/tensorflow/mnist directory

$ docker exec-ti Tensorflow/bin/bash
$ cd/tensorflow/mnist

Then execute one of the following commands in four terminal to start a task node in TensorFlow cluster,

# Start PS 0
python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222--worker_hosts=localhost:2223, localhost:2224--job_name=ps--task_index=0

# Start PS 1
python mnist_dist.py--ps_hosts=localhost:2221, localhost:2222--worker_hosts=localhost:2223,localhost:2224--job_name=ps--task_index=1

# Start worker 0
Python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222--worker_hosts=localhost:2223,localhost:2224--job_ Name=worker--task_index=0

# Start worker 1
python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222-- worker_hosts=localhost:2223,localhost:2224--job_name=worker--task_index=1

Specific effect of their own verification ha.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.