Introduction to Tensorflow distributed deployment

Source: Internet
Author: User

Introduction to Tensorflow distributed deployment

A major feature of tensorflow-0.8 is that it can be deployed on distributed clusters. The content of this article is translated by the distributed deployment manual of Tensorflow, which links to the distributed deployment manual of TensorFlow.

Distributed TensorFlow

This article describes how to build a TensorFlow server cluster and deploy a computing graph on the distributed cluster. The following operations are based on your understanding of TensorFlow operations.

Write a Hello world distributed instance

The following is an example of a simple TensorFlow distributed program.

# Start a TensorFlow server as a single-process "cluster".$ python>>> import tensorflow as tf>>> c = tf.constant("Hello, distributed TensorFlow!")>>> server = tf.train.Server.create_local_server()>>> sess = tf.Session(server.target)  # Create a session on the server.>>> sess.run(c)'Hello, distributed TensorFlow!'

Tf. train. Server. create_local_server () creates a single-process cluster locally, and services in the cluster are started by default.

Create a cluster)

A cluster in TensorFlow refers to a series of tasks that can perform distributed computing on the graph in TensorFlow ). Each task is associated with a service (server. TensorFlow contains a master node used to create a session and a worker node used for graph operations. In addition, a cluster in TensorFlow can be split into one or more jobs. Each job can contain one or more jobs. This is the author's understanding of the relationship in the cluster.

To create a cluster, you must start a service for each task. These tasks can run on different machines, but you can also start multiple tasks on the same machine (for example, running on multiple local GPUs ). Each task performs the following two steps:

  1. Createtf.train.ClusterSpecDescribes all tasks in the cluster. The description must be the same for all tasks.
  2. Createtf.train.ServerAnd settf.train.ClusterSpecInput parameters in the constructor, and write the job name and the number of the current job to the local job.
Create tf.train.ClusterSpecMethod

tf.train.ClusterSpecThe input parameters of are the ing between jobs and tasks. The tasks in the ing relationship are expressed by IP addresses and port numbers. The specific ing relationships are shown in the following table:

tf.train.ClusterSpecConstruction Available tasks
tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
/job:local/task:0 local
/job:local/task:1
tf.train.ClusterSpec({    "worker": [        "worker0.example.com:2222",         "worker1.example.com:2222",        "worker2.example.com:2222"    ],    "ps": [        "ps0.example.com:2222",        "ps1.example.com:2222"    ]})
/job:worker/task:0
/job:worker/task:1
/job:worker/task:2
/job:ps/task:0
/job:ps/task:1
Create tf.train.ServerInstance

Every tf. train. the Server object contains a collection of local devices, a set of connections to other tasks, and a "session target" ("session target") that can use the above resources for distributed computing. Each service program is a member of a specified job and has its own independent task number in the job. Each service program can communicate with any other service program in the cluster.
The following two code snippets show how to configure different tasks on the local ports 2222 and 2223.

# In task 0:cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})server = tf.train.Server(cluster, job_name="local", task_index=0)
# In task 1:cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})server = tf.train.Server(cluster, job_name="local", task_index=1)

Note: manual configuration of task nodes is still a preliminary practice, especially in the case of large cluster management. The tensorflow team is developing a tool for automatically configuring nodes for tasks. For example, Kubernetes, a cluster management tool. If you want tensorflow to support a specific management tool, you can send the request to the GitHub issue.

Specify a distributed device for the model

To run some operations on a specific process, you can use the tf. device () function to specify that the code runs on the CPU or GPU. For example:

with tf.device("/job:ps/task:0"):  weights_1 = tf.Variable(...)  biases_1 = tf.Variable(...)with tf.device("/job:ps/task:1"):  weights_2 = tf.Variable(...)  biases_2 = tf.Variable(...)with tf.device("/job:worker/task:7"):  input, labels = ...  layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)  logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)  # ...  train_op = ...with tf.Session("grpc://worker7.example.com:2222") as sess:  for _ in range(10000):    sess.run(train_op)

In the above example, the parameter declaration is completed through two tasks in the ps job, while the related part of the model calculation is carried out in the work job. TensorFlow transfers data between jobs internally. (Forward transmission between ps and work; calculation gradient between work and ps)

Computing process

In the preceding common training configuration item called "Data parallelization", multiple tasks are generally used to calculate different data sizes (workJob) and one or more tasks distributed on different machines for constantly updating shared parameters (constitutepsJob ). All these tasks can run on different machines. There are many ways to implement this logic. Currently, the TensorFlow team uses the link library (lib) method to simplify the model. The following methods are implemented:

  • In-graph replication. In this method, the client creates an independenttf.Graph, A series of nodes (tf.Variable) Will passpsJob (/job: ps) Declaration, and multiple copies related to computing are carried out through the work job (/job: worker.
  • Between-graph replication. In this method, each task (/job:worker) Are declared separately through an independent client. Similar to each other, each client creates a similar graph structure. The parameters in this structure arepsDeclare a job (/job: ps) and MAP parameters to different tasks using the tf. train. replica_device_setter () method. Each Independent Computing Unit in the model is mapped/job:worker.
  • Asynchronous training. In this method, each graph backup uses an independent training logic for independent training. This method must be used together with the preceding two methods.
  • Synchronous training. In this method, all computing tasks read the same values in the current parameter and use them for the parallel computing gradient. Then, the calculation results are merged. This method requires the In-graph replication method (for example, In the CIFAR-10 multi-GPU trainer we use this method to calculate the average of the gradient) copy Between the graph (Between-graph replication) (for example,tf.train.SyncReplicasOptimizer.
Examples of distributed training programs

The following code is a general code framework of a distributed training program, which implements two methods: Copy between graphs and asynchronous training. This example contains the code of parameter server and work task.

Import tensorflow as tf # Flags for defining the tf. train. clusterSpectf. app. flags. DEFINE_string ("ps_hosts", "", "Comma-separated list of hostname: port pairs") tf. app. flags. DEFINE_string ("worker_hosts", "", "Comma-separated list of hostname: port pairs") # Flags for defining the tf. train. servertf. app. flags. DEFINE_string ("job_name", "", "One of 'ps', 'worker'") tf. app. flags. DEFINE_integer ("task_index", 0, "Index of task within the job") FLAGS = tf. app. flags. FLAGSdef main (_): ps_hosts = FLAGS. ps_hosts.split (",") worker_hosts = FLAGS. worker_hosts (",") # Create a cluster from the parameter server and worker hosts. cluster = tf. train. clusterSpec ({"ps": ps_hosts, "worker": worker_hosts}) # Create and start a server for the local task. # create and start a service # Use task_index to specify the task id server = tf. train. server (cluster, job_name = FLAGS. job_name, task_index = FLAGS. task_index) if FLAGS. job_name = "ps": server. join () elif FLAGS. job_name = "worker": # Assigns ops to the local worker by default. # attach the op to a local worker with tf. device (tf. train. replica_device_setter (worker_device = "/job: worker/task: % d" % FLAGS. task_index, cluster = cluster): # Build model... loss =... global_step = tf. variable (0) train_op = tf. train. adagradOptimizer (0.01 ). minimize (loss, global_step = global_step) saver = tf. train. saver () summary_op = tf. merge_all_summaries () init_op = tf. initialize_all_variables () # Create a "supervisor", which oversees the training process. sv = tf. train. supervisor (is_chief = (FLAGS. task_index = 0), logdir = "/tmp/train_logs", init_op = init_op, summary_op = summary_op, saver = saver, global_step = global_step, save_model_secs = 600) # The supervisor takes care of session initialization, restoring from # a checkpoint, and closing when done or an error occurs. with sv.managed_session(server.tar get) as sess: # Loop until the supervisor shuts down or 1000000 steps have completed. step = 0 while not sv. should_stop () and step <1000000: # Run a training step asynchronously. # See 'tf. train. syncReplicasOptimizer 'for additional details on how to # perform * synchronous * training. _, step = sess. run ([train_op, global_step]) # Ask for all the services to stop. sv. stop () if _ name _ = "_ main _": tf. app. run ()

Run the following command to start two parameter services and two work tasks. (Assume that the above python script is named train. py)

# On ps0.example.com:$ python trainer.py \     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \     --job_name=ps --task_index=0# On ps1.example.com:$ python trainer.py \     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \     --job_name=ps --task_index=1# On worker0.example.com:$ python trainer.py \     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \     --job_name=worker --task_index=0# On worker1.example.com:$ python trainer.py \     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \     --job_name=worker --task_index=1
Glossary

Client)

  • The client is a Session Layer used to create a TensorFlow computing graph and interact with the cluster.tensorflow::SessionProgram. Generally, the client is implemented through python or C ++. An independent client process can be connected to multiple TensorFlow servers at the same time (the computing flow section above), and an independent server can also be connected to multiple clients.

Cluster)
-A TensorFlow cluster contains one or more jobs. Each job can be split into one or more tasks ). The concept of a cluster is mainly used in a specific high-level object, such as training a neural network and performing parallel operations on multiple machines. Cluster objects can be accessed throughtf.train.ClusterSpec.
Job)
-A job can be split into multiple tasks with the same purpose. For example, a task in a job called ps (parameter server) mainly stores and updates variables, A job named work is generally used to manage stateless tasks that are mainly engaged in computing. Tasks in a job can run on different machines, and the job role is flexible and variable. For example, jobs called "work" can save some states.
Master service)
-An RPC service program can be used to remotely connect to a series of distributed devices and play the role of a session terminal. The main service program implementstensorflow::SessionAnd is responsible for communicating with work tasks through the service process (worker service) of the Work node. All the main service programs have the service logic of the master node.
Task)
-A task is equivalent to a specific TesnsorFlow server. It is equivalent to an independent process, which belongs to a specific job and has a corresponding serial number in the job.
TensorFlow server)
-One is running.tf.train.ServerThe process of the instance, which is a member of the cluster and has the master node and work node.
Worker service)
-This is an RPC logic that allows you to use local devices to calculate some graphs. The service logic of a worker node is implemented.worker_service.protoInterface. All TensorFlow servers contain the service logic of the worker node.

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.