Learning notes TF040: Multi-GPU parallel

Last Update:2017-08-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learning notes TF040: Multi-GPU parallel
TensorFlow parallelism, model parallelism, and data parallelism. Different parallel modes are designed for different models in parallel. Different computing nodes of the model are placed on different hardware workers for resource operations. Data parallelism is more common and easy to implement large-scale parallel mode. Multiple hardware resources are used to compute different batch data gradients and aggregate gradient global parameter updates.

Data parallelism: Multiple GPUs train multiple batch data at the same time and run on each GPU model based on the same neural network. The network structure is the same and model parameters are shared.

Synchronize data in parallel. After all GPUs compute the batch data gradient, the statistics combine multiple gradients to update the shared model parameters, similar to using a large batch. When the GPU model and speed are consistent, the efficiency is the highest.
Asynchronous Data Parallel, without waiting for all GPUs to complete a training, which GPU to complete the training, immediately updates the gradient to the shared model parameters.
Synchronizing data in parallel is faster than asynchronous convergence, and the model accuracy is higher.

Synchronize data parallelism, dataset CIFAR-10. Load dependency library, TensorFlow Models cifar10 class, download CIFAR-10 data preprocessing.

Set the batch size to 128, the maximum number of steps to 1 million (the middle is stopped at any time, the model is saved regularly), and the number of GPUs is 4.

Define the computing loss function tower_loss. Cifar10.distorted _ inputs generate data enhancement images and labels. Call cifar10.inference to generate a convolutional network. Each GPU generates a separate network with the same structure and shares model parameters. Call the cifar10.loss loss calculation function (stored in collection by loss) based on the convolutional network and labels. tf. get_collection ('losses ', scope) to obtain the current GPU loss (scope limited), tf. add_n all losses are combined to get total_loss. Return total_loss as the function result.

Defines the average_gradients function, which is used for gradient synthesis of different GPUs. The input parameter tower_grads provides a gradient list of two layers. The outer list has different GPU computing gradients. The inner list GPU calculates different Variable gradients. The innermost element (grads, variable) and tower_grads basic element Binary Group (gradient, variable). The specific form is [[(grad0_gpu0, var0_gpu0), (grad1_gpu0, var1_gpu0)…], [(Grad0_gpu1, var0_gpu1), (grad1_gpu1, var1_gpu1) ......]. Create the average gradient list average_grads, where the gradient is averaged on different GPUs. Zip (* tower_grads) dual-layer list transpose, change [(grad0_gpu0, var0_gpu0), (grad0_gpu1, var0_gpu1)……], [(Grad1_gpu0, var1_gpu0), (grad1_gpu1, var1_gpu1)……] Form, looping element. Cyclically obtain the grad_and_vars element, which is the same as the Variable gradient in different GPU computing results. Calculate the mean gradient for different GPU computing copies with the same Variable gradient. Gradient n-dimensional vector, with the mean of each dimension. Tf. expand_dims adds redundancy Dimension 0 to the gradient, and the gradient is placed in the grad list. Merge on tf. concat Dimension 0. Tf. performance_mean: The 0 mean dimension, and all other dimensions are average. The average gradient and Variable are combined to obtain the original binary group (gradient and Variable) format and added to the average_grads list. Returns average_grads after all gradient values are obtained.

Define the training function. Set the CPU of the default computing device. Global_step records the number of Global Training Steps, calculates the number of epoch corresponding to the number of batch, and the number of learning rate attenuation requires decay_steps. Tf. train. exponential_decay creates a training step that degrades the learning rate. The first parameter is the initial learning rate, the second parameter is the number of global training steps, the third parameter is the number of steps required for each attenuation, the fourth parameter is the attenuation rate, and staircase is set to true. Set the optimization algorithm GradientDescent and pass in the random step count to degrade the learning rate.

Defines the tower_grads list for storing GPU computing results. Number of GPU creation cycles. In a loop, tf. device specifies which GPU to use. Tf. name_scope namespace.

The GPU uses tower_loss to obtain the loss. Tf. get_variable_scope (). reuse_variables () Reuse parameters. GPUs share a model with identical input parameters. Opt. compute_gradients (loss) calculates a single GPU gradient and adds it to the gradient list tower_grads. Average_gradients calculates the mean gradient. opt. apply_gradients updates the model parameters.

Create a model saver and set the Session allow_soft_placement parameter to True. Some operations can only be performed on the CPU without using soft_placement. Initialize all parameters. tf. train. start_queue_runner () prepares a large number of data enhancement training samples to prevent the training from being blocked in the sample generation.

Training Cycle, maximum number of iterations max_steps. Perform the update gradient operation apply_gradient_op (one training operation) in each step to calculate the loss operation loss. Time. time () records time. The current batch loss is displayed every 10 steps. The number of trainable samples per second and the time spent on training each batch. In every 1000 steps, Saver saves the entire model file.

Cifar10.maybe _ download_and_extract () download complete CIFAR-10 data and train () start training.

Loss dropped from to 70th steps to 0.07. The average batch time is 0.021 s, and the average training time is 6000 samples per second, 4 times that of a single GPU.

Import OS. path
Import re
Import time
Import numpy as np
Import tensorflow as tf
Import cifar10
Batch_size = 1, 128
# Train_dir = '/tmp/cifar10_train'
Max_steps = 1000000
Num_gpus = 4
# Log_device_placement = False
Def tower_loss (scope ):
"Calculate the total loss on a single tower running the CIFAR model.
Args:
Scope: unique prefix string identifying the CIFAR tower, e.g. 'Tower _ 0'
Returns:
Tensor of shape [] containing the total loss for a batch of data
"""
# Get images and labels for CIFAR-10.
Images, labels = cifar10.distorted _ inputs ()
# Build inference Graph.
Logits = cifar10.inference (images)
# Build the portion of the Graph calculating the losses. Note that we will
# Assemble the total_loss using a custom function below.
_ = Cifar10.loss (logits, labels)
# Assemble all of the losses for the current tower only.
Losses = tf. get_collection ('losses ', scope)
# Calculate the total loss for the current tower.
Total_loss = tf. add_n (losses, name = 'total _ loss ')
# Compute the moving average of all individual losses and the total loss.
# Loss_averages = tf. train. ExponentialMovingAverage (0.9, name = 'avg ')
# Loss_averages_op = loss_averages.apply (losses + [total_loss])
# Attach a scalar summary to all individual losses and the total loss; do
# Same for the averaged version of the losses.
# For l in losses + [total_loss]:
# Remove 'Tower _ [0-9]/'from the name in case this is a multi-GPU training
# Session. This helps the clarity of presentation on tensorboard.
# Loss_name = re. sub ('% s _ [0-9] */' % cifar10.TOWER _ NAME, '', l. op. name)
# Name each loss as '(raw)' and name the moving average version of the loss
# As the original loss name.
# Tf. scalar_summary (loss_name + '(raw)', l)
# Tf. scalar_summary (loss_name, loss_averages.average (l ))
# With tf. control_dependencies ([loss_averages_op]):
# Total_loss = tf. identity (total_loss)
Return total_loss
Def average_gradients (tower_grads ):
"Calculate the average gradient for each shared variable loss SS all towers.
Note that this function provides a synchronization point within SS all towers.
Args:
Tower_grads: List of lists of (gradient, variable) tuples. The outer list
Is over individual gradients. The inner list is over the gradient
Calculation for each tower.
Returns:
List of pairs of (gradient, variable) where the gradient has been averaged
Every SS all towers.
"""
Average_grads = []
For grad_and_vars in zip (* tower_grads ):
# Note that each grad_and_vars looks like the following:
# (Grad0_gpu0, var0_gpu0),..., (grad0_gpuN, var0_gpuN ))
Grads = []
For g, _ in grad_and_vars:
# Add 0 dimension to the gradients to represent the tower.
Expanded_g = tf. expand_dims (g, 0)
# Append on a 'Tower 'dimension which we will average over below.
Grads. append (expanded_g)
# Average over the 'Tower 'dimension.
Grad = tf. concat (grads, 0)
Grad = tf. performance_mean (grad, 0)
# Keep in mind that the Variables are redundant because they are shared
# Login SS towers. So... we will just return the first tower's pointer
# The Variable.
V = grad_and_vars [0] [1]
Grad_and_var = (grad, v)
Average_grads.append (grad_and_var)
Return average_grads
Def train ():
"" Train CIFAR-10 for a number of steps ."""
With tf. Graph (). as_default (), tf. device ('/cpu: 0 '):
# Create a variable to count the number of train () CILS. This equals
# Number of batches processed * FLAGS. num_gpus.
Global_step = tf. get_variable (
'Global _ step', [],
Initializer = tf. constant_initializer (0), trainable = False)
# Calculate the learning rate schedule.
Num_batches_per_epoch = (cifar10.NUM _ EXAMPLES_PER_EPOCH_FOR_TRAIN/
Batch_size)
Decay_steps = int (num_batches_per_epoch * cifar10.NUM _ EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
Lr = tf. train. exponential_decay (cifar10.INITIAL _ LEARNING_RATE,
Global_step,
Decay_steps,
Cifar10.LEARNING _ RATE_DECAY_FACTOR,
Staircase = True)
# Create an optimizer that performs gradient descent.
Opt = tf. train. GradientDescentOptimizer (lr)
# Calculate the gradients for each model tower.
Tower_grads = []
For I in range (num_gpus ):
With tf. device ('/gpu: % d' % I ):
With tf. name_scope ('% s _ % d' % (cifar10.TOWER _ NAME, I) as scope:
# Calculate the loss for one tower of the CIFAR model. This function
# Constructs the entire CIFAR model but shares the variables parameter SS
# All towers.
Loss = tower_loss (scope)
# Reuse variables for the next tower.
Tf. get_variable_scope (). reuse_variables ()
# Retain the summaries from the final tower.
# Summaries = tf. get_collection (tf. GraphKeys. SUMMARIES, scope)
# Calculate the gradients for the batch of data on this CIFAR tower.
Grads = opt. compute_gradients (loss)
# Keep track of the gradients should SS all towers.
Tower_grads.append (grads)
# We must calculate the mean of each gradient. Note that this is
# Synchronization point within SS all towers.
Grads = average_gradients (tower_grads)
# Add a summary to track the learning rate.
# Summaries. append (tf. scalar_summary ('learning _ rate', lr ))
# Add histograms for gradients.
# For grad, var in grads:
# If grad is not None:
# Summaries. append (
# Tf. histogram_summary (var. op. name + '/gradients', grad ))
# Apply the gradients to adjust the shared variables.
Apply_gradient_op = opt. apply_gradients (grads, global_step = global_step)
# Add histograms for trainable variables.
# For var in tf. trainable_variables ():
# Summaries. append (tf. histogram_summary (var. op. name, var ))
# Track the moving averages of all trainable variables.
# Variable_averages = tf. train. ExponentialMovingAverage (
# Cifar10.MOVING _ AVERAGE_DECAY, global_step)
# Variables_averages_op = variable_averages.apply (tf. trainable_variables ())
# Group all updates to into a single train op.
# Train_op = tf. group (apply_gradient_op, variables_averages_op)
# Create a saver.
Saver = tf. train. Saver (tf. all_variables ())
# Build the summary operation from the last tower summaries.
# Summary_op = tf. merge_summary (summaries)
# Build an initialization operation to run below.
Init = tf. global_variables_initializer ()
# Start running operations on the Graph. allow_soft_placement must be set
# True to build towers on GPU, as some of the ops do not have GPU
# Implementations.
Sess = tf. Session (config = tf. ConfigProto (allow_soft_placement = True ))
Sess. run (init)
# Start the queue runners.
Tf. train. start_queue_runners (sess = sess)
# Summary_writer = tf. train. SummaryWriter (train_dir, sess. graph)
For step in range (max_steps ):
Start_time = time. time ()
_, Loss_value = sess. run ([apply_gradient_op, loss])
Duration = time. time ()-start_time
Assert not np. isnan (loss_value), 'model diverged with loss = nan'
If step % 10 = 0:
Num_examples_per_step = batch_size * num_gpus
Examples_per_sec = num_examples_per_step/duration
Sec_per_batch = duration/num_gpus
Format_str = ('step % d, loss = %. 2f (%. 1f examples/sec; %. 3f'
'Sec/batch )')
Print (format_str % (step, loss_value,
Examples_per_sec, sec_per_batch ))
# If step % 100 = 0:
# Summary_str = sess. run (summary_op)
# Summary_writer.add_summary (summary_str, step)
# Save the model checkpoint periodically.
If step % 1000 = 0 or (step + 1) = max_steps:
# Checkpoint_path = OS. path. join (train_dir, 'model. ckpt ')
Saver. save (sess, '/tmp/cifar10_train/model. ckpt', global_step = step)
Cifar10.maybe _ download_and_extract ()
# If tf. gfile. Exists (train_dir ):
# Tf. gfile. DeleteRecursively (train_dir)
# Tf. gfile. MakeDirs (train_dir)
Train ()

References:
TensorFlow practice

Welcome to paid consultation (150 RMB per hour), My: qingxingfengzi

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Learning notes TF040: Multi-GPU parallel

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Learning notes TF040: Multi-GPU parallel

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support