TensorFlow Parallel Computing: multicore (multicore), multithreading (multi-thread), graph segmentation (graph Partition)

TensorFlow Parallel Computing: multicore (multicore), multithreading (multi-thread), graph segmentation (graph Partition) _tensorflow

Last Update:2018-08-23 Source: Internet

Author: User

Tags cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

GitHub Download Complete code

Https://github.com/rockingdingo/tensorflow-tutorial/tree/master/mnist

Brief introduction

It takes a long time to use the TensorFlow training depth neural network model, because the parallel computing provides an important way to improve the running speed. TensorFlow provides a variety of ways to run the program in parallel, and the questions to consider when using these methods are whether the selected computing device is CPU or GPU, how many cores of each CPU are in parallel, how resources are allocated to graph graph, and so on. Here are some common ways to improve the speed of your TensorFlow program with Linux multi-core CPU environments. Multi-core Parallel: CPU multi-core operations and resource calls

In the TensorFlow program, we often see "with Tf.device" ("/cpu:0"): "this statement." Using this statement alone, without any other restrictions, the default TensorFlow program consumes all available memory resources and CPU cores, such as if your Linux server is a 8-core CPU, the program will quickly occupy any CPU that can be used, using nearly 100%, The end result is other programs that affect the entire server. So we think about the number and resources of the CPU cores that need to be restricted.

In building the TF. session () variable, you can pass in the TF. Configproto () parameter to change the number of CPU cores used for a tensorflow session and the number of threads, and so on.

Code 1

Config = tf. Configproto (device_count={"CPU": 4}, # limit to Num_cpu_core CPU usage
                inter_op_parallelism_threads = 1, 
                Intra_ Op_parallelism_threads = 1,
                log_device_placement=true) with
TF. session (config = config) as sess:
    # to do

For example, in the above code we construct a Configproto () class with the "device_count={" CPU ": 4}" parameter to pass in the TF. Session () to allocate the appropriate resources for each conversation, where we have allocated 4 CPU core to the TensorFlow program. Instance

We made appropriate modifications to the Mnist CNN model convolutional.py In TensorFlow tutorial, which allowed him to limit the operation to 4 CPU core and get convolutional_multicore.py: Run results , averaging each step calculates a batch time for 611 MS around:

Figure 1:4 CPU Core Single Thread

Two, multiple threads, set multi-threads

In the TF. Configproto () initialization, we can also control the number of threads that each operator op computes in parallel by setting the Intra_op_parallelism_threads parameter and the Inter_op_parallelism_threads parameter. The difference is:

Intra_op_parallelism_threads Control operator op internal parallel when the operator op is a single operator, and the internal can implement parallelism, such as matrix multiplication, reduce_sum and so on, you can set the INTRA_OP_ Parallelism_threads parameters to parallel, intra represents the interior. Inter_op_parallelism_threads controls parallel computations between multiple operator op when there are more than one operator op, and they are relatively independent, there is no direct path link between the operator and the operator. TensorFlow will try to compute them in parallel, using a thread pool that controls the number by the Inter_op_parallelism_threads parameter.

Code 2 We are still modifying the example of the mnist convolution above convolutional_multicore.py, the parameters are as follows:

Config = tf. Configproto (device_count={"CPU": 4}, # limit to Num_cpu_core CPU usage
                inter_op_parallelism_threads = 1, 
                Intra_ Op_parallelism_threads = 4,
                log_device_placement=true) with
TF. session (config = config) as sess:
  # to do

Instance

Instance comparison, we compare the number of threads to 2 and 4, averaging the elapsed time of each batch: when the parameter is Intra_op_parallelism_threads = 2 o'clock, the average elapsed time of each step is reduced from 610ms to 380ms.

When the parameter is Intra_op_parallelism_threads = 4 o'clock, the average elapsed time of each step is reduced from 610ms to 230ms.

It is concluded that under fixed cpucore resource constraints, the TensorFlow program can be greatly improved by reasonably setting the number of thread threads.

Reference Tensorflow/core/protobuf/config.proto Implementation

Https://github.com/tensorflow/tensorflow/blob/26b4dfa65d360f2793ad75083c797d57f8661b93/tensorflow/core/ protobuf/config.proto#l165

Partition Graph Model Graph: Assign TensorFlow graph operations to different computing units

Sometimes the structure of the deep network we build is very complex, and this happens: When multiple CPUs are running at the same time, some of the cores are idle, and some have a nuclear utilization rate of 100%. We need to try to avoid the unbalanced calculation of this operator. At this point, if we split graph into several parts, we assign each part (such as each layer network structure) to a different CPU kernel operation, and optimize the allocation of computation, so that the operation speed can be improved.

A very intuitive design is divided according to different layers, the computation of large layer allocated a separate CPU, the operation of small layer merged into the same CPU core.

The following is a test we do, or TensorFlow official online convolutional.py example rewrite, the different layers allocated to different CPU device, optimized computing resources, so that the speed of the program can be improved, examples of Convolutional_ graph_partitioned.py.

Declares that the DEVICE_ID global variable records the ID of the CPU that is already in use, and calls the Next_device () function to return the next available CPU device ID, and if available, allocate and make the global variable device_id plus 1, the resulting available device_ The ID does not exceed the total number of cores defined in Flags.num_cpu_core. In the process of building the model () function, use the WITH Tf.device (Next_device ()): statement to assign the conv, pool, and other operators to separate CPUs. The final results for each batch mean time 229 Ms.

Code 3

device_id =-1 # Global Variable Counter for device_id used
 
def next_device (use_cpu = True):
    ' ' vailable next device;
        ARGS:USE_CPU, global device_id
        return:new device ID '
  global device_id
  if (use_cpu):
    if ( device_id + 1) < flags.num_cpu_core:
      device_id = 1
    device = '/cpu:%d '% device_id
  else:
    if (Devic e_id + 1) < flags.num_gpu_core:
      device_id = 1
    device = '/gpu:%d '% device_id return
  device with
   
t F.device (Next_device ()):
  # to do Insert Your Code
  conv = ...
  Pool = ...

Extended reading deep language Artificial intelligence-technology blog: http://www.deepnlp.org/blog/tensorflow-parallelism/
PyPI deepnlp:deep Learning NLP Pipeline implemented on TensorFlow
Https://pypi.python.org/pypi/deepnlp
TensorFlow official website of convolution neural network mnist example convolutional.py multithreading convolutional_multithread.py split graph to different device convolutional_graph _partitioned.py

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More