[Go] In-depth Understanding Batch normalization batch Standardization

Source: Internet
Author: User

Transferred from: https://www.cnblogs.com/guoyaohua/p/8724433.html

  

Guo Yaohua's blog wants to be poor, to the heights
Project home: https://github.com/guoyaohua/
    • Blog Park
    • Home
    • New Essays
    • Contact
    • Subscription
    • Management
To the poor, to the heights
Project home: https://github.com/guoyaohua/
    • Blog Park
    • Home
    • New Essays
    • Contact
    • Subscription
    • Management
"Deep learning" in-depth understanding of batch normalization batch standardization

These days the interview is often asked the principle of bn layer, although the answer came up, but still feel the answer is not very good, today carefully study the principle of batch normalization, the following for the reference online several articles summarized.

As a major achievement of DL in the last year, Batch normalization has been widely proven to be effective and important. Although some of the details of the explanation is not clear its theoretical reasons, but the practice proves that the use is really good, do not forget the DL from Hinton to the deep network to do Pre-train beginning is an experience leading to theoretical analysis of the experience of a learning. This article is a guide to the paper "Batch normalization:accelerating deep Network Training by reducing Internal covariate Shift".

There is an important assumption in the machine learning field that the IID is independent of thedistribution hypothesis , that is, the assumption that the training data and the test data satisfy the same distribution, which is a basic guarantee that the model obtained by training data can get good results in the test set. What is the role of the batchnorm? Batchnorm is to maintain the same distribution of input of each layer of neural network during the training of deep neural network.

Next step-by-step understanding of what is bn.

Why do deep neural networks deepen with the depth of the network, the more difficult the training, convergence more and more slowly? This is a good question that is very close to the essence in the DL field. Many papers solve this problem, such as relu activation functions, such as residual network,bn is essentially an explanation and a different angle to solve the problem.

First, "Internal covariate Shift" problem

As can be seen from the paper name, BN is used to solve the "Internal covariate shift" problem, then first understand what is "Internal covariate shift"?

The paper first explains the two advantages of Mini-batch SGD relative to one Example sgd: Gradient update direction is more accurate; parallel computing is fast; (why do you say that?) Because Batchnorm is based on Mini-batch SGD, so first kua mini-batch SGD, of course truth); and then the downside of the SGD training under the spat: hyper-parameters are cumbersome to tune up. (the author implies that using bn can solve many of the shortcomings of SGD)

The concept of covariate shift is then introduced: if the distribution of input values X in the ML System Instance collection <X,Y> is always changed, this does not conform to the IID hypothesis , the network model is difficult to stabilize the law , This does not introduce migration learning to get it done, our ML system has to learn how to cater to this distribution change AH. For deep learning This contains a lot of hidden layer network structure, in the training process, because each layer parameters are constantly changing, so each hidden layer will face covariate shift problem, that is , in the training process, the input distribution of the hidden layer always change to change, this is called "Internal Covariate shift ", internal refers to the hidden layer of the deep network, which is what happens inside the network, and not the covariate Shift problem occurs only in the input layer.

Then it puts forward the basic idea of batchnorm: Can we fix the activation input distribution of each hidden layer node ? This avoids the "Internal covariate Shift" problem.

BN is not a good idea to shoot out of the head, it is a source of inspiration: Previous studies have shown that if the input image in the image processing of albino (whiten) operation-so-called whitening , is the distribution of input data to 0 mean, the normal distribution of the Unit variance -then the neural network will be faster convergence, then the BN author began to infer: the image is a deep neural network input layer, do whitening can accelerate convergence, in fact, for the depth of the network, one of the hidden layer of neurons is the next layer of input, meaning that the depth of the neural network of each hidden layer is the input layer, But is it relative to the next layer, then can you do the whitening of each hidden layer? This is the original idea that inspired bn, which is exactly what bn does, and it can be understood as a simplified version of the whitening of the activation values of each hidden layer neuron in the deep neural network.

Second,The essential thought of batchnorm

BN's basic idea is actually quite straightforward: because the deep neural network before the non-linear transformation of the activation input value (that is, the x=wu+b,u is input) with the depth of the network or in the training process, its distribution gradually shifted or changed, the reason for training convergence is slow, Generally, the overall distribution is moving toward the upper and lower bounds of the value interval of the nonlinear function (for the sigmoid function, meaning that the activation input value wu+b is a large negative or positive), so this causes the gradient of the low-level neural network to disappear when the reverse propagation This is the essential reason to train the deep neural network to converge more and more slowly, and bn is to force the distribution of the input value of each neural network neuron into a standard normal distribution with a mean value of 0 variance of 1 by means of a certain normalization method . In fact, the distribution of the more and more biased force pull back to the comparative standard distribution, so that the activation input value falls in the non-linear function of the input sensitive region, so that the small changes in the input will lead to a large loss function changes, meaning that the gradient to become larger, to avoid the problem of gradient extinction, And the gradient becomes larger means that the learning convergence speed, can greatly accelerate the training speed.

That ' S IT. In fact, a word is: for each hidden layer neuron, the gradual non-linear function mapping to the value of the interval limit saturation of the input distribution force pull back to the mean value of 0 variance is 1 of the standard normal distribution, so that the input value of the nonlinear transformation function into the region sensitive to input, This avoids the problem of gradient vanishing. because the gradient has been able to maintain a relatively large state, it is obvious to the neural network parameter adjustment efficiency is higher, that is, the change is large, that is, the optimal value of the loss function to move a large step, that is, convergence faster. BN in the final analysis is such a mechanism, the method is very simple, the truth is very profound.

The above is still the abstract, the following more vividly express the meaning of this adjustment in the end.

Figure 1 Several normal distributions

Assuming that the original activation input x value of a hidden layer neuron is in accordance with the normal distribution, the normal distribution mean is-2, The variance is 0.5, corresponding to the left side of the light blue curve, through the conversion of bn after the mean is 0, the variance is 1 of the normal distribution (corresponding to the dark blue shape), meaning that the value of the input x normal distribution of the whole right shift 2 (change of Mean), the graph curve is more gradual (variance increases). The meaning of this figure is that bn is in fact the activation of each hidden layer neuron input distribution from the deviation mean of 0 variance to 1 of the normal distribution by the translation mean compression or expand the sharpness of the curve, adjusted to mean 0 variance of 1 of the normal distribution.

So what's the use of adjusting the activation input x to this normal distribution? First we look at the mean value of 0, the variance of 1 of the standard normal distribution represents what meaning:

Figure 2 Standard normal distribution graph with a mean of 0 variance of 1

This means that within a standard deviation range, that is, 64% of the probability x its value falls within the range of [ -1,1], within two standard deviations, that is, the probability of 95% x its value falls within the range of [ -2,2]. So what does that mean? We know that the activation value x=wu+b,u is the real input, x is the activation value of a neuron, assuming that the nonlinear function is sigmoid, then look at the sigmoid (x) of its graph:

Figure 3. Sigmoid (x)

and the derivative of sigmoid (x) is: G ' =f (x) * (1-f (x)), because F (x) =sigmoid (x) between 0 and 1, so G ' is between 0 and 0.25, the corresponding diagram is as follows:

Figure 4 Sigmoid (x) derivative diagram

Assuming that the original normal distribution before X is 6 and the variance is 1, then it means that the value of 95% falls between [ -8,-4] and the corresponding sigmoid (x) function is significantly closer to 0, which is a typical gradient saturation region where the gradient varies very slowly, Why are gradient saturation areas? Take a look at the sigmoid (x) If the value is close to 0 or close to 1, the corresponding derivative function, close to 0, means that the gradient changes are small or even disappear. Assuming that after the BN, the mean is 0, the variance is 1, then means that 95% of the X value falls in the [ -2,2] interval, it is obvious that the sigmoid (x) function is close to the linear transformation of the region, meaning that the small change of X will lead to a large number of nonlinear function changes, that is, the gradient is larger, In the corresponding derivative function graph, the area which is obviously greater than 0 is the gradient unsaturated zone.

It should be seen from the above figures what is bn doing? In fact, the implicit neuron activation input x=wu+b from the change eclectic normal distribution by the BN operation to pull back to the mean of 0, the variance is 1 of the normal distribution, that is, the original normal distribution center left or right to 0 as the mean, stretching or reducing the shape to form a 1 variance graph. What do you mean? That is, after bn, most of the current activation values fall into the linear region of the nonlinear function, the corresponding derivative is far away from the derivative saturation region, so as to accelerate the training convergence process.

But it's clear that readers here who understand a little bit about neural networks often ask the same question: if they all pass through bn, wouldn't it be the same effect as replacing the nonlinear function with a linear function? What does that mean? We know that if it is a multi-layered linear function transformation, this depth is meaningless, because multilayer linear networks are equivalent to a layer of linear networks. This means that the network's ability to express is reduced, which means that the meaning of the depth is gone. so bn in order to guarantee the non-linearity, the transformation of the satisfied mean is 0 variance of 1 x and the scale plus shift operation (Y=scale*x+shift), each neuron added two parameter scale and shift parameters, These two parameters are learned by training, meaning that the value is shifted from the normal normal distribution to the left or the right point by the scale and shift, and a little bit more or less, each instance moves in a different degree, which is equivalent to the value of the nonlinear function moving from the linear region around the positive center to the nonlinear region. The core idea is to find a good equilibrium point of linearity and nonlinearity, which can not only enjoy the advantages of the strong nonlinear expression, but also avoid the convergence speed of the network too slow by the two ends of the nonlinear zone. Of course, this is my understanding, the author of the paper did not specifically say so. However, it is obvious that the scale and shift operations here are controversial, because according to the ideal state written in the paper author's thesis, The transformed X will be adjusted back to the non-transformed state by using the shift operation, which is not to spare a circle and go around the original "Internal Covariate shift "problem, I feel that the author has not been able to clearly explain the theoretical reasons for scale and Shift operations.

Iii. How to do batchnorm in the training stage

The above is the abstract analysis and interpretation of bn, specifically in Mini-batch sgd How to do bn? In fact, this piece of paper is clearly written and easy to understand. To ensure the integrity of this article, here is a brief explanation.

Suppose for a deep neural network, where two layers are structured as follows:

Figure 5 DNN Two layers

To make a bn of the activation value of each hidden layer neuron, it can be imagined that each hidden layer adds a layer of bn operation Layer, which is located after the X=wu+b activation value is obtained, before the nonlinear function transformation, its diagram is as follows:

Figure 6. bn operation

For Mini-batch SGD, a training process contains m training instances, the specific bn operation is for each neuron within the hidden layer of the activation value, the following transformation:

Note that the X (k) of a neuron in the T layer does not refer to the original input, i.e. not the output of each neuron in the t-1 layer, but the linear activation x=wu+b of the neuron in the T layer, where u is the output of the t-1 layer neuron. The transformation means that the original activation x corresponding to a neuron is converted by subtracting the mean E (x) from the M-activated x obtained by the M instance in Mini-batch and dividing by the calculated variance Var (x).

As mentioned above, the activation of a neuron after this transformation x formed a normal distribution with a mean value of 0 and a variance of 1, with the aim of pulling the value toward the linear region of the subsequent nonlinear transformation, increasing the derivative value, enhancing the flow of information in the reverse direction, and accelerating the training convergence speed. However, this can lead to a decline in network expression ability, in order to prevent this, each neuron adds two adjustment parameters (scale and shift), these two parameters are learned through training, used to transform the activation of the inverse transformation, so that the network expression ability to enhance, The following scale and shift operations are performed on the transformed activation, which is actually the inverse of the transformation:

BN its specific operating procedures, as described in the paper:

The process is very clear, is the above formula of the flow of description, here does not explain, directly should be able to read.

Iv. batchnorm Reasoning (inference) process

bn in training can be based on a number of training instances in Mini-batch to activate the numerical adjustment, but in the reasoning (inference) process, it is obvious that there is only one instance of input, see Mini-batch Other instances, then how to do the input bn? Because it is obvious that an instance is unable to calculate the mean and variance of the instance set. How can this be good?

Since there are no statistics available from the Mini-batch data, there are other ways to get this statistic, which is mean and variance. You can use the statistics obtained from all training instances to replace the mini-batch of the mean and variance statistics obtained in the M training instances, because it is intended to use the global statistics, just because the amount of computation is too large to be used mini-batch this simplified way, Then use the global statistics directly when reasoning.

Determines the range of data to get the statistics, then the next question is how to get the mean and variance problem. Very simple, because every time you do mini-batch training, there will be the Mini-batch in the M training instance to get the mean and variance, now to global statistics, as long as the average and variance statistics for each mini-batch to remember, Then the corresponding mathematical expectation of these mean and variance can be obtained by the global statistic, namely:

With the mean and variance, each hidden layer neuron also has a corresponding training scaling parameters and shift parameters, can be deduced at the time of the activation data of each neuron to calculate NB to transform, in the reasoning process of bn to take the following way:

This formula actually and when training

is equivalent, this conclusion can be obtained by simply combining the calculation deduction. So why write this transformation form? I guess the author's writing means: in the actual operation, according to this variant form can reduce the amount of computation, why? Because for each hidden layer node:

          

Are fixed values, so that these two values can be in advance to save up, in the reasoning of the direct use of the line, so that each step of the original formula is less than the division of the operation process, at first glance there is no less than the amount of computation, but if the number of hidden layer nodes are more than the amount of computational energy savings.

V. Benefits of Batchnorm

Batchnorm Why NB, the key or good effect. not only greatly improve the training speed, the convergence process is greatly accelerated; ② can also increase the classification effect, one explanation is that this is similar to the dropout to prevent overfitting of the regular expression, so do not dropout can achieve a considerable effect; ③ the process of additional tuning is much simpler, For initialization requirements are not so high, and can use a large learning rate, and so on. all in all, the benefits of such a simple transformation are many, which is why BN is so popular now.

[Go] In-depth Understanding Batch normalization batch Standardization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.