Batch Normalization Guide

Source: Internet
Author: User

Transferred from: http://blog.csdn.net/malefactor/article/details/51476961

Author: Zhang Junlin


As a major achievement of DL in the last year, Batch normalization has been widely proven to be effective and important. Currently almost has become the standard DL, any students who are interested in learning DL friends Radice My Jentman should learn to learn bn. bn upside down to see is NB, because this technology is really NB, although some of the details of processing also explain the theoretical reasons, but the practice proved to be really good, do not forget the DL from Hinton to the deep network to do Pre-train start is a experience ahead of the theoretical analysis of the partial experience of a learning.


How to understand Batchnorm? Please refer to the paper: Batch normalization:accelerating deep Network Training by reducing Internal covariate Shift. Because the part of the foundation is not too good classmates friends radice my Jentman may have some obstacles to reading comprehension, so this article is to make it easier to understand the BN and do a guide. Because my level is also very limited, assuming the tour guide is wrong, then ... you'll be unlucky. Well, after all, it's a free tour guide. You say no, expect not too high, "any person or thing with high expectations is the world's most unhappy person", this is from my non-famous quotes, so " Lowering expectations is the way to happiness, and this is my famous quote.


There is an important assumption in the machine learning field that the IID is independent of the distribution hypothesis, that is, the assumption that the training data and the test data satisfy the same distribution, which is a basic guarantee that the model obtained by training data can get good results in the test set. And what is batchnorm to do? Batchnorm is to maintain the same distribution of input of each layer of neural network during the training of deep neural network. Ok,bn finished, goodbye.


Well, the pace is a little bit big, we slow down, the learning rate is lower, step by step approach to understand the best solution of bn.


Why deep neural networks deepen with the depth of the network, the more difficult training, convergence more and more slowly. This is a good question that is very close to the essence in the DL field. Many papers solve this problem, such as relu activation functions, such as residual network,bn is essentially an explanation and a different angle to solve the problem.


| Internal covariate Shift "problem


As can be seen from the name of the paper, BN is used to solve the "internalcovariate shift" problem, then first to understand what is "Internal covariate shift".


The paper first explains the two advantages of Mini-batch SGD relative to one Example SGD: The gradient update direction is more accurate; parallel computing is fast; (the author: Why do we say this?) Because Batchnorm is based on Mini-batch SGD, so first kua mini-batch SGD, of course truth);


And then spit the downside of SGD training: The hyper-parameter is troublesome to tune up. (The author of this article: the author implies that using My big bn can solve many of the shortcomings of SGD: With the big bn, mom no longer have to worry about my ability to adjust)


Then introduce the concept of covariate shift: If the distribution of the input value X in the ML System Instance collection <X,Y> is always changed, this does not conform to the IID hypothesis Ah, then how do you let me steady law ah, this can not introduce migration study to get it done, Our ML system also has to learn how to cater to this distribution change.


For deep learning This contains a lot of hidden layer of network structure, in the training process, because each layer parameter is changing, so each hidden layer will face covariate shift problem, that is, in the training process, the input distribution of the hidden layer always change to change, this is called "Internal Covariate shift ", internal refers to the hidden layer of the deep network, which is what happens inside the network, and not the covariate Shift problem occurs only in the input layer.


Then it puts forward the basic idea of batchnorm: Can we fix the activation input distribution of each hidden layer node? This avoids the "Internal covariate Shift" problem.


BN is not a good idea to shoot out of the head, it is a source of inspiration: Previous studies have shown that if the input image in the image processing of the whitening (whiten) operation-so-called whitening, is the distribution of input data to 0 mean, the normal distribution of the unit variance-then the neural network will be faster convergence, Then the BN author began to infer: the image is a deep neural network input layer, do whitening can accelerate convergence, in fact, for the depth of the network, one of the hidden layer of neurons is the next layer of input, meaning that in fact, the depth of the neural network of each hidden layer is the input layer, but relative to the next layer, So can you do whitening on every hidden layer? This is the original idea that inspired bn, which is exactly what bn does, and it can be understood as a simplified version of the whitening of the activation values of each hidden layer neuron in the deep neural network.


| The essential thought of batchnorm


BN's basic idea is actually quite straightforward: because the deep neural network before the non-linear transformation of the activation input value (that is, the x=wu+b,u is input) with the depth of the network or in the training process, its distribution gradually shifted or changed, the reason for training convergence is slow, Generally, the whole distribution is moving toward the upper and lower bounds of the value interval of the nonlinear function (for the sigmoid function, which means that the activation input value wu+b is a large negative or positive), so this results in the gradient disappearance of the low-level neural network at the back propagation, which is the essential reason to train the deep neural network to converge more and more slowly And BN is through a certain standardized means, each layer of neural network arbitrary neurons of the input value of the distribution force pull back to the mean of 0 variance of 1 is the standard is too distributed rather than Lori distribution (oh, is normal distribution), in fact, the distribution of more and more biased to pull back to the distribution of comparative standards, This allows the activation input value to fall in the non-linear function of the input sensitive region, so that small changes in the input will lead to a large loss function changes, meaning that the gradient becomes larger, to avoid the gradient extinction problem arises, and the gradient becomes larger means learning convergence speed, can greatly accelerate training speed.


That ' S IT. In fact: For each hidden layer neuron, the gradual to nonlinear function mapping to the value of the interval limit saturation of the input distribution force pull back to the mean value of 0 variance is 1 of the standard normal distribution, so that the input value of the nonlinear transformation function into the region sensitive to input, so as to avoid the problem of gradient extinction. Because the gradient has been able to maintain a relatively large state, it is obvious to the neural network parameter adjustment efficiency is higher, that is, the change is large, that is, the optimal value of the loss function to move a large step, that is, convergence faster. NB in the final analysis is such a mechanism, the method is very simple, the truth is very profound.


The above is still the abstract, the following more vividly express the meaning of this adjustment in the end.

Figure 1. Several normal distributions


Assuming that the original activation input x value of a hidden layer neuron is in accordance with the normal distribution, the normal distribution mean is-2, The variance is 0.5, which corresponds to the left-most light blue curve in the above image, which is converted by bn to mean 0, the variance is 1 of the normal distribution (corresponding to the dark blue figure in the image above), means that the value of input x is normal distribution of the whole right shift 2 (change of Mean), the graph curve is more smooth (variance The meaning of this figure is that bn is in fact the activation of each hidden layer neuron input distribution from the deviation mean of 0 variance to 1 of the normal distribution by the translation mean compression or expand the sharpness of the curve, adjusted to mean 0 variance of 1 of the normal distribution.


So what is the use of adjusting the activation input x to this normal distribution?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.