TensorFlow implements Batch Normalization,

Last Update:2018-03-15 Source: Internet

Author: User

Tags network function

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

TensorFlow implements Batch Normalization,

I. BN (Batch Normalization) Algorithm

1. Importance of Data normalization

The essence of the neural network learning process is to learn Data Distribution. When training data is different from test data distribution, the generalization capability of the model is greatly reduced. On the other hand, if the data distribution of each batch of batch is also different during the training process, the iterative learning process of each batch of the network will also fluctuate greatly, making it more difficult to converge and reducing the training convergence speed. For deep networks, minor changes in the first several layers of the network are accumulated and amplified, and the distribution and changes of training data are amplified, which further affects the training speed.

2. Advantages of BN Algorithms

1) in order to accelerate the training of the gradient descent algorithm, we can adopt the exponential decay learning rate and other methods to quickly learn at the initial stage, and then slowly enter the global optimal area. After using the BN algorithm, you can directly select a relatively large learning rate, and set a large learning rate attenuation speed, greatly improving the training speed. Even if a small learning rate is selected, the convergence speed is faster than that without BN. In summary, the BN algorithm has the feature of fast convergence.

2) BN has the ability to improve network generalization. After using the BN algorithm, you can remove the dropout and L2 regularization items set for the over-fitting problem, or use smaller L2 regularization parameters.

3) If BN is a normalized network layer, Local Response Normalization layer (LRN layer) is unnecessary (used in the Alexnet network ).

3. BN algorithm Overview

The BN algorithm proposes transformation reconstruction and introduces the trainable parameters gamma and β, which are the key points of the algorithm:

After these two parameters are introduced, our network can learn to restore the feature distribution originally learned by the network. The transfer process of the cash box on the BN layer is as follows:

M indicates the batchsize. All operations in BatchNormalization are smooth and traceable, which enables back propagation to run effectively and learn the corresponding parameters gamma and β. Note that Batch Normalization has different behaviors in training and testing. During Training, μ β and σ β are calculated by the current batch. during Testing, μ β and σ β should use the average value stored during Training or similar treated value, instead of the current batch.

Ii. TensorFlow related functions

1. tf. nn. moments (x, axes, shift = None, name = None, keep_dims = False)

X is the input tensor, And the axes solves the problem on which dimension, that is, the dimension for normalize. [0] indicates the batch dimension. If it is image data, [0, 1, 2], which is equivalent to the mean value/variance of [batch, height, width]. Do not add it to the channel Dimension. This function returns two tensor, mean and variance.

2. tf. identity (input, name = None)

Returns a tensor consistent with the input tensor input shape and content.

3. tf. nn. batch_normalization (x, mean, variance, offset, scale, variance_epsilon, name = None)

The formula is scale (x-mean)/variance + offset.

Among these parameters, tf. nn. moments can get mean and variance. offset and scale are trainable. offset is usually initialized to 0, scale is initialized to 1, and the shape of offset and scale is the same as mean, the variance_epsilon parameter is set to a very small value, such as 0.001.

Iii. TensorFlow code implementation

1. complete code

Import tensorflow as tf import numpy as np import matplotlib. pyplot as plt ACTIVITION = tf. nn. relu N_LAYERS = 7 # Total Layer 7 hidden layer N_HIDDEN_UNITS = 30 # each layer contains 30 neurons def fix_seed (seed = 1): # Set the random number seed np. random. seed (seed) tf. histogram (seed) def plot_his (inputs, inputs_norm): # histogram function for j, all_inputs in enumerate ([inputs, inputs_norm]): for I, input in enumerate (all_inputs ): plt. subplot (2, len (all_inputs), j * len (all_inputs) + (I + 1) plt. linoleic () if I = 0: the_range = (-7, 10) else: the_range = (-1, 1) plt. hist (input. ravel (), bins = 15, range = the_range, color = '# FF5733') plt. yticks () if j = 1: plt. xticks (the_range) else: plt. xticks () ax = plt. gca () ax. spines ['right']. set_color ('None') ax. spines ['top']. set_color ('None') plt. title ("% s normalizing" % ("Without" if j = 0 else "With") plt. draw () plt. pause (0.01) def built_net (xs, ys, norm): # build a network function # add layers def add_layer (inputs, in_size, out_size, activation_function = None, norm = False ): weights = tf. variable (tf. random_normal ([in_size, out_size], mean = 0.0, stddev = 1.0) biases = tf. variable (tf. zeros ([1, out_size]) + 0.1) Wx_plus_ B = tf. matmul (inputs, Weights) + biases if norm: # determine whether it is a Batch Normalization layer # calculate the mean value and variance. axes parameter 0 indicates the batch dimension fc_mean, fc_var = tf. nn. moments (Wx_plus_ B, axes = [0]) scale = tf. variable (tf. ones ([out_size]) shift = tf. variable (tf. zeros ([out_size]) epsilon = 0.001 # defines the moving average model object ema = tf. train. exponentialMovingAverage (decay = 0.5) def mean_var_with_update (): ema_apply_op = ema. apply ([fc_mean, fc_var]) with tf. control_dependencies ([ema_apply_op]): return tf. identity (fc_mean), tf. identity (fc_var) mean, var = mean_var_with_update () Wx_plus_ B = tf. nn. batch_normalization (Wx_plus_ B, mean, var, shift, scale, epsilon) if activation_function is None: outputs = export else: outputs = activation_function (outputs) return outputs fix_seed (1) if norm: # perform BN fc_mean, fc_var = tf for the first layer. nn. moments (xs, axes = [0]) scale = tf. variable (tf. ones ([1]) shift = tf. variable (tf. zeros ([1]) epsilon = 0.001 ema = tf. train. exponentialMovingAverage (decay = 0.5) def mean_var_with_update (): ema_apply_op = ema. apply ([fc_mean, fc_var]) with tf. control_dependencies ([ema_apply_op]): return tf. identity (fc_mean), tf. identity (fc_var) mean, var = mean_var_with_update () xs = tf. nn. batch_normalization (xs, mean, var, shift, scale, epsilon) layers_inputs = [xs] # record the input of each layer for l_n in range (N_LAYERS ): # add layer-7 layer_input = layers_inputs [l_n] in_size = layers_inputs [l_n] in sequence. get_shape () [1]. value output = add_layer (layer_input, in_size, N_HIDDEN_UNITS, ACTIVITION, norm) Partition (output) prediction = add_layer (layers_inputs [-1], 30, 1, activation_function = None) cost = tf. performance_mean (tf. performance_sum (tf. square (ys-prediction), reduction_indices = [1]) train_op = tf. train. gradientDescentOptimizer (0.001 ). minimize (cost) return [train_op, cost, layers_inputs] fix_seed (1) x_data = np. linspace (-7, 10,250 0) [:, np. newaxis] np. random. shuffle (x_data) noise = np. random. normal (0, 8, x_data.shape) y_data = np. square (x_data)-5 + noise plt. scatter (x_data, y_data) plt. show () xs = tf. placeholder (tf. float32, [None, 1]) ys = tf. placeholder (tf. float32, [None, 1]) train_op, cost, layers_inputs = built_net (xs, ys, norm = False) train_op_norm, cost_norm, struct = built_net (xs, ys, norm = True) with tf. session () as sess: sess. run (tf. global_variables_initializer () cost_his = [] cost_his_norm = [] record_step = 5 plt. ion () plt. figure (figsize = (7, 3) for I in range (250): if I % 50 = 0: all_inputs, all_inputs_norm = sess. run ([layers_inputs, layers_inputs_norm], feed_dict = {xs: x_data, ys: y_data}) plot_his (all_inputs, all_inputs_norm) sess. run ([train_op, train_op_norm], feed_dict = {xs: x_data [I * 10: I * 10 + 10], ys: y_data [I * 10: I * 10 + 10]}) if I % record_step = 0: cost_his.append (sess. run (cost, feed_dict = {xs: x_data, ys: y_data}) cost_his_norm.append (sess. run (cost_norm, feed_dict = {xs: x_data, ys: y_data}) plt. ioff () plt. figure () plt. plot (np. arange (len (cost_his) * record_step, np. array (cost_his), label = 'withoutbn ') # no norm plt. plot (np. arange (len (cost_his) * record_step, np. array (cost_his_norm), label = 'with BN ') # norm plt. legend () plt. show ()

2. Experiment results

Input data distribution:

Comparison of batch standardized BN results:

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More