TensorFlow Learning Notes 4: Distributed TensorFlow
Brief Introduction
The TensorFlow API provides cluster, server, and supervisor to support distributed training of models.
The distributed training introduction about TensorFlow can refer to distributed TensorFlow. A simple overview is as follows: The TensorFlow distributed cluster consists of multiple tasks, each of which corresponds to a tf.train.Server instance, as a separate node of cluster; multiple tasks of the same function can be divided into a job, For example, PS job saves only the parameters of the TensorFlow model as a parameter server, while the worker job performs only compute-intensive graph calculations as a compute node. tasks in cluster communicate with each other for state synchronization, parameter updates, and so on.
The code executed by all nodes of the TensorFlow distributed cluster is the same. The distributed task code has a fixed pattern:
# 1th Step: Command line parameter parsing, get cluster information ps_hosts and worker_hosts, and current node role information job_name and Task_index
# 2nd: Create the server
for the current task node Cluster = Tf.train.ClusterSpec ({"PS": ps_hosts, "worker": worker_hosts})
server = Tf.train.Server (cluster, job_ Name=flags.job_name, Task_index=flags.task_index)
# Step 3rd: If the current node is PS, call Server.join () endless wait; if it is a worker, then step 4th.
if Flags.job_name = = "PS":
server.join ()
# Step 4th: Build the model to be trained
# builds TensorFlow graph model
# Step 5th: Create a tf.train.Supervisor to manage the model's training process
# Create a ' supervisor ', which oversees the training process.
SV = Tf.train.Supervisor (is_chief= (flags.task_index = 0), logdir= "/tmp/train_logs") # The supervisor takes care of
Session initialization and restoring from a checkpoint.
Sess = Sv.prepare_or_wait_for_session (server.target)
# Loop until the supervisor shuts down while not
Sv.should_stop ()
# Train model
TensorFlow Distributed Training Code Framework
According to the TensorFlow distributed training code fixed pattern mentioned above, if you want to write a distributed TENSORLFOW code, the framework below is shown below.
Import TensorFlow as TF # Flags for defining the Tf.train.ClusterSpec tf.app.flags.DEFINE_string ("Ps_hosts", "",
"Comma-separated List of Hostname:port pairs") tf.app.flags.DEFINE_string ("Worker_hosts", "", "Comma-separated List of Hostname:port pairs") # Flags for defining the Tf.train.Server Tf.app.flags.DE Fine_string ("Job_name", "", "one of ' ps ', ' worker '") Tf.app.flags.DEFINE_integer ("Task_index", 0, "Index of task within th E job ") FLAGS = Tf.app.flags.FLAGS def Main (_): Ps_hosts = FLAGS.ps_hosts.split (", ") worker_hosts = Flags.worker_ho
STS (",") # Create a cluster from the parameter server and worker hosts. Cluster = Tf.train.ClusterSpec ({"PS": ps_hosts, "worker": Worker_hosts}) # Create and start a server for the local task
. Server = Tf.train.Server (cluster, Job_name=flags.job_name, Task_inde X=flags.task_index) If Flags.job_name = = "PS": server.join() elif Flags.job_name = = "worker": # assigns OPS to the local worker by default. With Tf.device (Tf.train.replica_device_setter (worker_device= "/job:worker/task:%d"% Flags.task_index, CLU Ster=cluster): # build model ... loss = ... global_step = tf. Variable (0) Train_op = Tf.train.AdagradOptimizer (0.01). Minimize (loss, global_step=global_step) SA ver = tf.train.Saver () Summary_op = Tf.merge_all_summaries () Init_op = Tf.initialize_all_variables () # Cr
Eate a "supervisor", which oversees the training process.
SV = Tf.train.Supervisor (is_chief= (flags.task_index = 0), logdir= "/tmp/train_logs", Init_op=init_op, Summary_op=summary_op, sav Er=saver, Global_step=global_step, save_model_secs=600) # T He supervisor takes care of Session initialization and restoring from # a checkpoint.
Sess = Sv.prepare_or_wait_for_session (server.target) # Start Queue runners for the ' input ' pipelines (if any).
Sv.start_queue_runners (sess) # Loop until the supervisor shuts down (or 1000000 steps have).
Step = 0 While not sv.should_stop () and Step < 1000000: # Run A training step asynchronously.
# the ' Tf.train.SyncReplicasOptimizer ' for additional details on the How to # perform *synchronous* training. _, Step = Sess.run ([Train_op, Global_step]) if __name__ = = "__main__": Tf.app.run ()
For all TensorFlow distributed code, only two points are available: build TensorFlow graph model code; Execute a trained code distributed mnist task each step
We construct distributed mnist samples for verification by modifying the mnist_softmax.py provided by Tensorflow/tensorflow. Please refer to mnist_dist.py for the modified code.
We also use the Tensorlfow Docker image to start a container for verification.
$ docker run-d-v/path/to/your/code:/tensorflow/mnist--name TensorFlow tensorflow/tensorflow
After starting the TensorFlow, start 4 terminal, then enter the TensorFlow container via the command below to switch to the/tensorflow/mnist directory
$ docker exec-ti Tensorflow/bin/bash
$ cd/tensorflow/mnist
Then execute one of the following commands in four terminal to start a task node in TensorFlow cluster,
# Start PS 0
python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222--worker_hosts=localhost:2223, localhost:2224--job_name=ps--task_index=0
# Start PS 1
python mnist_dist.py--ps_hosts=localhost:2221, localhost:2222--worker_hosts=localhost:2223,localhost:2224--job_name=ps--task_index=1
# Start worker 0
Python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222--worker_hosts=localhost:2223,localhost:2224--job_ Name=worker--task_index=0
# Start worker 1
python mnist_dist.py--ps_hosts=localhost:2221,localhost:2222-- worker_hosts=localhost:2223,localhost:2224--job_name=worker--task_index=1
Specific effect of their own verification ha.