1:motivation
The author thinks: The Change of parameters in the course of network training leads to the change of the distribution of each layer input, and the process of learning makes each layer adapt to the output
So that we have to reduce the learning rate and carefully initialize them. The authors change the distribution to call internal covariate shift.
For deep learning This contains a lot of hidden layer of network structure, in the training process, because each layer parameter is changing, so each hidden layer will face covariate shift problem, that is, in the training process, the input distribution of the hidden layer always change to change, this is called "Internal Covariate shift ", internal refers to the hidden layer of the deep network, which is what happens inside the network, and not the covariate Shift problem occurs only in the input layer.
As you should know, we usually subtract the mean from the input when we train the network, and some people even do the whitening of the input to speed up the training. Why reduce the mean, whitening can speed up training, here to do a simple explanation: first, the image data is highly correlated, assuming that its distribution as shown in Figure A (simplified to 2-D). Since the initialization time, our parameters are generally 0 mean, so the start of fitting y=wx+b, basic over the origin, such as Figure B red dashed line. Therefore, the network needs to go through several studies to gradually achieve such as the purple solid line of the fitting, that is, the convergence is relatively slow. If we take the input data first to reduce the mean value operation, as shown in Figure C, it is obvious that we can speed up learning. Further, we work on the data again, making the data easier to distinguish, and speeding up training, as shown in Figure D.
There are several ways to whiten, often with PCA whitening: After the PCA operation of the data, the variance is normalized. This data basically satisfies 0 mean, unit variance, weak correlation. The authors first consider the use of albinism for each layer of data, but the analysis says it is undesirable. Because the whitening needs to calculate the covariance matrix, the inversion and so on operation, the computation is very big, moreover, in the reverse propagation, the whitening operation may not necessarily lead.
2. The essential thought of Batchnorm
BN's basic idea is actually quite straightforward: because the deep neural network before the non-linear transformation of the activation input value (that is, the x=wu+b,u is input) with the depth of the network or in the training process, its distribution gradually shifted or changed, the reason for training convergence is slow, Generally, the whole distribution is moving toward the upper and lower bounds of the value interval of the nonlinear function (for the sigmoid function, which means that the activation input value wu+b is a large negative or positive), so this results in the gradient disappearance of the low-level neural network at the back propagation, which is the essential reason to train the deep neural network to converge more and more slowly And BN is through a certain standardized means, each layer of neural network arbitrary neurons of the input value of the distribution force pull back to the mean of 0 variance of 1 is the standard is too distributed rather than Lori distribution (oh, is normal distribution), in fact, the distribution of more and more biased to pull back to the distribution of comparative standards, This allows the activation input value to fall in the non-linear function of the input sensitive region, so that small changes in the input will lead to a large loss function changes, meaning that the gradient becomes larger, to avoid the gradient extinction problem arises, and the gradient becomes larger means learning convergence speed, can greatly accelerate training speed.
That ' S IT. In fact: For each hidden layer neuron, the gradual to nonlinear function mapping to the value of the interval limit saturation of the input distribution force pull back to the mean value of 0 variance is 1 of the standard normal distribution, so that the input value of the nonlinear transformation function into the region sensitive to input, so as to avoid the problem of gradient extinction. Because the gradient has been able to maintain a relatively large state, it is obvious to the neural network parameter adjustment efficiency is higher, that is, the change is large, that is, the optimal value of the loss function to move a large step, that is, convergence faster. NB in the final analysis is such a mechanism, the method is very simple, the truth is very profound.
The above is still the abstract, the following more vividly express the meaning of this adjustment in the end.
Figure 1: Several normal distributions
Assuming that the original activation input x value of a hidden layer neuron is in accordance with the normal distribution, the normal distribution mean is-2, The variance is 0.5, which corresponds to the left-most light blue curve in the above image, which is converted by bn to mean 0, the variance is 1 of the normal distribution (corresponding to the dark blue figure in the image above), means that the value of input x is normal distribution of the whole right shift 2 (change of Mean), the graph curve is more smooth (variance The meaning of this figure is that bn is in fact the activation of each hidden layer neuron input distribution from the deviation mean of 0 variance to 1 of the normal distribution by the translation mean compression or expand the sharpness of the curve, adjusted to mean 0 variance of 1 of the normal distribution.
So what is the use of adjusting the activation input x to this normal distribution?
First we look at the mean value of 0, the variance of 1 of the standard normal distribution represents what meaning:
Figure 2: Standard normal distribution graph with a mean of 0 variance of 1
This means that within a standard deviation range, that is, 64% of the probability x its value falls within the range of [ -1,1], within two standard deviations, that is, the probability of 95% x its value falls within the range of [ -2,2]. So what does that mean. We know that the activation value x=wu+b,u is the real input, x is the activation value of a neuron, assuming that the nonlinear function is sigmoid, then look at the sigmoid (x) of its graph:
Figure 3. Sigmoid (x)
and the derivative of sigmoid (x) is: G ' =f (x) * (1-f (x)), because F (x) =sigmoid (x) between 0 and 1, so G ' is between 0 and 0.25, the corresponding diagram is as follows:
Figure 4. Sigmoid (x) derivative diagram
Assuming that the original normal distribution of x is 6, the variance is 1, and that the value of 95% falls between [ -8,-4], then the corresponding sigmoid (x) Letter
The value of the number is significantly closer to 0, which is a typical gradient saturation region where the gradient varies very slowly and why is the gradient saturated region. Take a look at sigmoid (x) If the value is close to 0
or close to 1, the corresponding derivative function value, close to 0, means that the gradient changes are small or even disappear. Assuming that the mean value is 0 and the variance is 1 after bn, it means
The x value of 95% falls within the [ -2,2] interval, and it is obvious that the sigmoid (x) function is close to the linear transformation, which means that small changes in X can result in a large number of nonlinear functions
Change, that is, the gradient is larger, the corresponding derivative function graph is significantly greater than 0 of the region, is the gradient unsaturated zone.
It should be seen from the above figures what bn is doing. In fact, the implicit neuron activates the input x=wu+b from the normal distribution of the change eclectic and pulls it back through the bn operation.
A normal distribution with a mean value of 0 and a variance of 1, that is, the center of the original normal distribution shifts left or right to 0 as the mean, stretching or shrinking to form a graph with 1 variance. What's The meaning
Think. That is, after bn, most of the current activation values fall into the linear region of the nonlinear function, the corresponding derivative is far away from the derivative saturation area, so as to accelerate the training convergence
Process.
But it's clear that readers who have a little understanding of neural networks here are generally asking a question: If you pass bn, then instead of replacing the nonlinear function with a linear function
The effect is the same. What this means. We know that if it is a multi-layered linear function transformation, this depth is meaningless, because the multilayer linear network with a layer of linear network
is equivalent. This means that the network's ability to express is reduced, which means that the meaning of the depth is gone. So bn in order to guarantee the non-linearity of the acquisition, the transformation of the satisfaction
An X with a value of 0 variance of 1 has a scale plus shift operation (Y=scale*x+shift), and each neuron adds two parameter scale and shift parameters, both of which are trained
Practice learning, meaning to shift the value from the normal normal distribution to the left or by moving a little bit and getting fatter or thinner by scale and shift, each instance is not
, this is equivalent to the value of the nonlinear function moving from the linear region around the positive center to the nonlinear region. The core idea should be to find a good balance of linearity and nonlinearity.
Point, not only can enjoy the advantages of the strong non-linearity, but also avoid the two ends of the non-linear region to make the network convergence speed too slow. Of course, this is my understanding, the author of the paper did not
Clearly said so. However, it is clear that the scale and shift operations here are controversial, because the ideal state that is written in the paper's author's paper will then be manipulated by the scale and shift
To adjust the transformed X back to the non-transformed state, it is not to spare a circle and go back to the original "Internal covariate Shift" problem, I feel that the author of the paper did not
Explain the theoretical reasons for scale and shift operations clearly enough.
3: How to do batchnorm in the training stage
The above is the abstract analysis and interpretation of bn, specifically in Mini-batch sgd how to do bn. In fact, this piece of paper is clearly written and easy to understand. To ensure that this article
Chapter integrity, here is a brief description below.
Let's say that for a deep neural network, the two-layer structure is as follows
the 4.BatchNorm reasoning process
bn during training can be adjusted according to several training instances in Mini-batch, but in the process of reasoning (inference), it is obvious that the input is only one
example, do not see Mini-batch other instances, then how to do the input bn it. Because it is obvious that an instance is unable to calculate the mean and variance of the instance set.
This can be good. This can be good. This can be good.
Since there are no statistics available from the Mini-batch data, there are other ways to get this statistic, which is mean and variance. Can be used from all training instances to obtain
Statistics to replace the mean and variance statistics obtained by the M training instances in Mini-batch, since it is intended to use global statistics, just because the amount of computation is too large
In order to use Mini-batch this simplification method, then in the reasoning of the direct use of global statistics can be.
Determines the range of data to get the statistics, then the next question is how to get the mean and variance problem. Very simple, because every time you do mini-batch training, there will be that
The mean and variance obtained by the M training instances in a Mini-batch are now global statistics, as long as the mean and variance statistics for each mini-batch are remembered, and these are
The value and variance of the corresponding mathematical expectation can be derived from the global statistics, namely:
5: The benefits of Batchnorm
Batchnorm Why NB, the key or good effect. Not only greatly improve the training speed, the convergence process is much faster, but also to increase the classification effect, an explanation is that this is
Similar to the dropout to prevent overfitting of the regular expression, so do not dropout can achieve considerable results. In addition, the process of tuning is much simpler, for initialization
Requirements are not so high, and can use a large learning rate and so on. All in all, after such a simple transformation, the benefits are many, which is why BN is so popular now
The reason for the rise.