Deep understanding of Batch normalization batch standardization __batchnorm

Source: Internet
Author: User

Batch normalization has been widely proven to be effective and important as a result of DL's recent year. Although some of the details of the processing also explain the theoretical reasons, but practice proved to be really good, do not forget that DL from Hinton to the deep network to do Pre-train began to be an experience leading to theoretical analysis of the biased experience of a learning. This article is a guide to the paper "Batch normalization:accelerating Deep Network training by reducing Internal Shift".

There is an important assumption in the field of machine learning that the IID is independent of thedistribution hypothesis , that is, that the training data and the test data are satisfied with the same distribution, which is a basic guarantee for the model obtained by training data to obtain good results in the test set. What is the function of the batchnorm? Batchnorm is to keep the input of each layer of neural network in the same distribution during the training of depth neural network.

The next step is to understand what a bn is.

Why the depth of neural network with the depth of the network, the more difficult to train, convergence more and more slowly. This is a very close to the nature of the DL field of good questions. Many papers address this problem, such as the Relu activation function, and residual network,bn essentially explains and solves the problem from a different point of view. First, "Internal covariate Shift" Problem

From the name of the paper, we can see that bn is used to solve the "Internal covariate shift" problem, then first understand what is "Internal covariate shift".

The paper first illustrates the two advantages of Mini-batch SGD relative to one Example sgd: Gradient update direction is more accurate, parallel computation speed is fast; (why do you say this?) Because the Batchnorm is based on Mini-batch SGD, the first boast Mini-batch SGD, of course, is also truth); then spit out the disadvantage of the SGD training: The overshoot is cumbersome. (the author implies that BN can solve many of the disadvantages of SGD)

Then introduce the concept of covariate shift : If the distribution of the input value X in the ML System Instance set <X,Y> is always changing, this does not conform to the IID hypothesis , the network model is difficult to stabilize the law , That's not going to be a migration study. Our ML system has to learn how to cater to this distribution change. For depth learning, which contains many hidden layers of the network structure, in the training process, because each layer of parameters are constantly changing, so each hidden layer will face covariate shift problem, that is , in the training process, the hidden layer of input distribution always change to change, this is called "Internal Covariate shift ", internal refers to the hidden layer of the deep network, what happens inside the network, rather than the covariate shift problem that occurs only at the input layer.

Then the basic idea of Batchnorm is put forward: can the activation input distribution of each hidden node be fixed . This avoids the "Internal covariate Shift" problem.

BN is not a good idea to shoot out of his head, it's a source of inspiration: Previous studies have shown that if the input image is whitened (whiten) in image processing--the so-called albinism --the distribution of input data is transformed to 0 mean, and the unit variance is normal. --then the neural network will converge faster, the BN authors began to infer that the image is the input layer of the deep neural network, and that bleaching can accelerate convergence, so in fact for a deep network, one of the hidden layers of neurons is the next level of input, meaning that in fact every hidden layer of the deep neural network is input layer, But it's relative to the next level, so can you whiten every hidden layer? This is the original idea that was inspired by bn, and that is what BN does, which is understood to be a simplified version of the whitening operation of the activation values of each hidden neuron in the deep neural network. Second,the essence of Batchnorm thought

The basic idea of bn is quite intuitive: because the deep neural network activates the input value before the Non-linear transformation (that is, the x=wu+b,u is input) as the depth of the network deepens or the distribution gradually shifts or changes during the training process, the training converges slowly, Generally, the overall distribution is gradually approaching the upper and lower ends of the range of nonlinear functions (for the sigmoid function, which means that the activation input value wu+b is a large negative or positive), so this causes the gradient of the lower neural network to disappear when the reverse propagates , This is the essential reason for training the deep neural network convergence is more and more slow, and bn is through a certain standard means, the distribution of arbitrary neurons in each layer of neural network forcibly pull back to the normal normal distribution with a mean value of 0 variance of 1. In fact, the more and more biased distribution forced to pull back to the distribution of the comparison standard, so that the activation of input values fall in the non-linear function of the input more sensitive to the region, so that the small changes in input will lead to a large loss function changes, meaning that this allows the gradient to become larger, to avoid And the gradient will mean learning to converge fast, can greatly accelerate the training speed.

That ' S IT. In fact, the word is: for each hidden neuron, the input distribution of the gradual transition to the limit saturation region of the nonlinear function map is forced to pull back to the standard normal distribution with the mean 0 variance as 1, so that the input value of the nonlinear transformation function falls into the sensitive region of the input. In order to avoid the problem of gradient disappearance. because the gradient has always been able to maintain a relatively large state, so it is obvious that the parameters of neural network adjustment efficiency is high, that is, a large change, that is, the optimal value of the loss function to move the pace is large, that is, fast convergence. BN is in the final analysis such a mechanism, the method is very simple, the reason is very profound.

The above is still abstract, the following more vividly expressed what this adjustment represents what the meaning.

Figure 1 Several normal distributions

Assuming that the original activation input x of a hidden neuron conforms to the normal distribution, the normal distribution is 2, The variance is 0.5, which corresponds to the light blue curve at the leftmost end of the graph, which is converted to a mean value of 0 by bn. The variance is a normal distribution of 1 (corresponding to the dark blue figure in the above image), meaning what means that the input X's value normal distribution of the overall right shift 2 (mean change), the graph curve is more smooth (variance increases change). The meaning of this diagram is that bn is actually the normal distribution of the activation input distribution of each hidden neuron from the normal distribution with a deviation mean of 0 variance of 1 and the mean value compressing or enlarging the sharp degree of the curve.

So what's the use of adjusting the activation input x to this normal distribution? First, let's look at what the mean value of a standard normal distribution with a mean of 0 and a variance of 1 means:

Fig. 2 standard normal distribution chart with mean 0 variance of 1

This means that within a standard deviation range, that is, the probability X of 64% falls within the range of [ -1,1], within two standard deviation ranges, that is, the probability X of 95% falls within [ -2,2]. So what does that mean. We know that the activation value x=wu+b,u is the real input, x is the activation value of a neuron, assuming that the Non-linear function is sigmoid, then look at the graph of sigmoid (x):

Figure 3. Sigmoid (x)

and the derivative of sigmoid (x) is: G ' =f (x) * (1-f (x)), because F (x) =sigmoid (x) is between 0 and 1, so G ' is between 0 and 0.25, and its corresponding figure is as follows:

Fig. 4 sigmoid (x) derivative diagram

Assuming the previous normal distribution of x before the bn adjustment is-6, the variance is 1, meaning that the value of 95% falls between [ -8,-4], then the value of the corresponding sigmoid (x) function is significantly closer to 0, which is a typical gradient saturation region in which gradients change very slowly, Why is the gradient saturation zone. Take a look at sigmoid (x) If the value is close to 0 or close to 1, the corresponding derivative function, close to 0, means that the gradient changes are small or even disappear. Assuming that after bn, the mean value is 0, the variance is 1, which means that the X value of 95% falls in the [ -2,2] range, and it is obvious that this section is sigmoid (x) function is close to the region of the linear transformation, which means that the small change of X will lead to a large change in the value of the nonlinear function, i.e. In the corresponding derivative function graph, the area which is obviously greater than 0 is the gradient unsaturated zone.

From the above several figures should see what bn is doing. In fact, the activation of the hidden neurons in the input x=wu+b from the change of the normal distribution of eclectic through the bn operation to pull back to the mean 0, the variance is 1 of the normal distribution, that is, the original normal distribution center left or right to 0 as the mean, stretching or reducing the shape to form a 1- What do you mean. That is, after bn, most of the activation values fall into the linear region of the nonlinear function, and the corresponding derivative is far away from the derivative saturation zone, so as to accelerate the training convergence process.

But it is clear that readers who have seen a little understanding of neural networks will generally ask a question: If all pass bn, then the effect of replacing a non-linear function with a linear function is the same. What this means. We know that if it is a multilayer linear function transformation In fact this deep is meaningless, because the multilayer linear network is equivalent to a layer of linear network. This means that the network's ability to express is down, which means that the meaning of depth is gone. so bn, in order to ensure the nonlinear gain, has a scale plus shift operation (Y=scale*x+shift) for x with a 0 variance of 1 after the transformation, and each neuron adds two parameters scale and shift parameters, These two parameters are learned by training, it means to move this value from the standard normal distribution to the left or to the right by scale and shift, and to gain weight or to be thinner, each instance moving in a different degree, so that the value of the nonlinear function is shifted from the linear region around the center to the nonlinear zone. The core idea should be to find a good equilibrium point of linearity and nonlinearity, which can not only enjoy the advantage of the strong expression ability of non-linear, but also avoid the network convergence rate too slow by the two ends of the non-linear zone. Of course, this is my understanding, the author of the paper did not explicitly say so. But obviously the scale and shift operations here are controversial, because according to the ideal state written in the paper author's paper, the transformed X is adjusted back to the unchanged state by scale and shift, which is not a circle and goes back to the original "Internal Covariate shift? The author of the paper is not able to explain clearly the theoretical reasons for scale and shift operations. Iii. How to do batchnorm in the training stage

The above is an abstract analysis and interpretation of BN, specifically under the Mini-batch SGD how to do bn. In fact, this piece of paper is very clear and easy to understand. To ensure the integrity of this article, here is a brief description.

Assuming that for a deep neural network, the two-layer structure is as follows:

Figure 5 DNN Two layers

To be bn for each hidden neuron activation value, it can be imagined that each hidden layer is added with a layer of bn operating layer, which is located after the X=wu+b activation value is obtained, and before the nonlinear function transforms, it is illustrated as follows:

Figure 6. bn operation

For Mini-batch SGD, a training process contains m training examples, the specific bn operation is for each neuron in the hidden layer activation value, the following transformation:

Note that the X (k) of a neuron in the T-layer is not the original input, that is, not the output of each neuron in the t-1 layer, but the linear activation of the T-layer neuron, where u is the output of the t-1 layer neuron. The transformation means that the original activation x of a neuron is converted by subtracting the mean E (x) of M Active X obtained by the M instance in Mini-batch and dividing by the variance var (x) evaluated.

As mentioned above, after this transformation, the activation X of a neuron forms a normal distribution with a mean value of 0 and a variance of 1, with the aim of pulling the value to the linear region of the subsequent Non-linear transformation, increasing the guide value, enhancing the flow of information in the reverse communication, and speeding up the training convergence. But this can lead to a decrease in the ability of the network to express, in order to prevent this, each neuron adds two adjustment parameters (scale and shift), these two parameters are trained to learn, used to transform the activation of the reverse transformation, so that the network expression ability to enhance, That is, the following scale and shift operations are performed on the transformed activation, which is actually the reverse operation of the transform:

The specific operating procedures of BN, as described in the paper:

The process is very clear, that is the flow of the formula described above, not explained here, directly should be able to read. Iv. The Batchnorm reasoning (inference) process

BN can activate the numerical adjustment according to several training instances in Mini-batch, but in the process of inference (inference), it is obvious that there is only one instance of input, and there is no mini-batch other instances. Because it is obvious that an instance is not able to calculate the mean and variance of the set of instances. It's good to be.

Since there is no statistic available from the Mini-batch data, there is another way to get this statistic, which is mean and variance. You can use the statistics obtained from all training examples to replace the mean and variance statistics obtained by M training instances in Mini-batch, because we are going to use the global statistics, just because the computation is too large, so we use the Mini-batch to simplify the method. Then you can use global statistics directly when you are reasoning.

Determines the range of data that gets the statistics, the next question is how to get the mean and variance. Very simple, because every time we do mini-batch training, there will be the Mini-batch in M training example of the mean and variance, now to global statistics, as long as the mean and variance statistics of each mini-batch to remember, Then the global statistic is obtained by calculating the corresponding mathematical expectation of these mean and variance, namely:

With the mean and variance, each of the hidden neurons also has a corresponding training scaling parameters and shift parameters, you can deduce the activation data for each neuron to calculate NB to transform, in the reasoning process NB take the following way:

This formula actually and when training

is equivalent, the conclusion can be drawn by a simple combination of calculation and deduction. So why write this transformation form. I guess the author's writing means: in the actual operation, according to this variant form can reduce the amount of calculation, why. Because for each hidden layer node:

Are fixed values, so that these two values can be stored in advance. In the reasoning of the direct use of the line, so more than the original formula each step is less division of the operation process, at first glance also not less than the amount of calculation, but if the number of hidden nodes more than the calculation of the amount of savings is more. v. Benefits of Batchnorm

Batchnorm Why NB, the key or effect is good. ① not only greatly improved the speed of training, the convergence process is greatly accelerated; ② can also increase the classification effect, one kind of explanation is this is similar to dropout to prevent the fitting the regular expression way, therefore does not have the dropout also to be able to achieve the considerable effect; ③ the process of the other parameters is much simpler, For initialization requirements are not so high, and can use a large learning rate, and so on. all in all, after such a simple transformation, there are so many benefits, which is why bn is so popular so soon.

Https://www.cnblogs.com/guoyaohua/p/8724433.html


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.