From the source of knowledge
One, two concepts independent of the same distribution
(Independent and identically distributed)
Independently distributed data simplifies the training of conventional machine learning models and improves the predictive power of machine learning models
Albino
(whitening)
Removal of correlations between features--independent;
So that all features have the same distribution of mean and variance.
Second, problem 1, high level of abstraction difficult to train
Deep neural network involves a lot of layers of superposition, and each layer of parameter update will cause the input data distribution changes in the upper layer, through layers superimposed, high-level (highly abstract) input distribution changes will be very intense, which makes the high-level need to constantly re-adapt to the underlying data updates
Google summarizes this phenomenon as Internal covariate Shif:
A classic hypothesis in statistical machine learning is that the data distribution (distribution) of the source domain and target space is consistent. If not, then a new machine learning problem arises, such as transfer learning/domain adaptation.
The covariate shift is a branching problem under the assumption that the conditional probabilities of the source and target spaces are consistent, but the marginal probabilities are different, namely: for all,
But
We will find that, indeed, for the various layers of neural network output, because they have been operating in the layer, the input signal distribution of each layer is obviously different, and the difference will increase with the depth of the network, but they can "indicate" the sample tag (label) is still unchanged, This conforms to the definition of covariate shift. Because it is the analysis of the inter-layer signal, it is the source of "internal".
The problem description in short, the input data for each neuron is no longer "isolated and distributed".
First, the upper parameters need to adapt to the new input data distribution and reduce the learning speed.
Second, the lower input changes may tend to become larger or smaller, resulting in the upper layer falling into the saturation zone, so that learning prematurely stop.
Third, each layer of updates will affect the other layers, so the parameter update strategy for each layer needs to be as cautious as possible.
2. Problem Challenge
We take an example of a common neuron in a neural network. Neurons receive a set of input vectors
After some sort of operation, output a scalar value:
Due to the existence of the ICS problem, the distribution of inputs from different batches may vary widely for a particular layer.
To solve the problem of independent distribution, the method of "theory is correct" is to whiten the data of each layer. However, the standard whitening operation is expensive, and in particular we hope that the whitening operation is micro, ensuring that the whitening operation can be reversed to update the gradient .
Third, the solution 1, the general framework
Before it is sent to the neuron, it is translated and scaled , and the distribution is normalized to a standard distribution in the fixed interval range.
The general transformation framework is as follows:
(1) is the translation parameter (shift parameter), which is the scaling parameter (scale parameter). Shift and scale transformations are performed using these two parameters:
The resulting data conforms to the standard distribution with a mean value of 0 and a variance of 1.
(2) is the re-translational parameter (re-shift parameter), which is the rescale parameter (re-scale parameter). The further transformations you get from the previous step are:
The resulting data conforms to the distribution of the mean and variance.
2, the purpose of the second transformation of the purpose of a
The first transformation results in a standard distribution with a mean value of 0, a variance of 1, and a limited ability to express, and the underlying neurons may be working very hard to learn, but regardless of how it changes, the output will be brutally re-adjusted to this fixed range before handing it over to the upper neurons for processing. To better apply the learning results of the underlying neural networks, we re-translate and re-scale the normalized data so that each neuron's corresponding input range is a defined range (mean, variance) that is tailored to the neuron. The parameters of Rescale and Reshift are both learning, which allows the normalization layer to learn how to adapt to the underlying learning outcomes.
Objective Two
Besides the ability to make full use of the underlying learning, on the other hand, it is important to ensure the non-linear expression ability.
The activation functions such as Sigmoid play an important role in the neural network, which makes the data transformation of the neural network have non-linear computing ability by distinguishing between saturated and unsaturated regions. The normalization of the first step will map almost all the data to the unsaturated region (linear region) of the activation function, and only use the linear change ability, thus reducing the expression ability of the neural network. Then, the data can be transformed from the linear region to the nonlinear region, and the expression ability of the model is restored.
Advantage
The mean value of not adding regularization depends on the complex association of the underlying neural network; When this layer is added, the value is only determined, excluding the close coupling to the underlying calculation. The new parameters are easy to learn by gradient descent, simplifying the training of neural networks.
Problem
The purpose of the standard whitening operation is to "separate and distribute". Independence will not be said, and temporarily do not consider. The distribution of the transformation to mean and variance is not strictly identical, but it is mapped to a definite range of intervals (so the problem still has a space for research).
Four, mainstream Normalization method comb batch Normalization
Presented by Google in 2015. See the "Tutorials" Batch normalization layer for the specific process .
BN independently standardizes each layer in different batches, but the normalized parameter is a mini-batch first-order statistic and second-level statistic. This requires that each mini-batch statistic is an approximate estimate of the overall statistic, or that each mini-batch, as well as the overall data, should be approximately the same distribution. Mini-batch with smaller distribution gaps can be seen as the introduction of noise for normalized operation and model training, which can increase the robustness of the model, but if the original distribution of each mini-batch is very different, then the data will be transformed differently Mini-batch This increases the difficulty of the model training.
BN is more suitable for the scenario is: each mini-batch larger, the data distribution is relatively close. Before the training, to do a good job of shuffle, otherwise the effect will be much worse.
In addition, because BN needs to count the first and second-order statistics of each mini-batch during operation, it is not suitable for dynamic network structures and RNN networks. However, some researchers have specifically proposed the use of BN for RNN, which does not begin here.
Batch Normalization
Unlike BN, LN is a horizontal normalization. It takes into account the input of all batches in a layer, calculates the average input value and the input variance of the layer, and then transforms the input of each dimension with the same normalization operation.
"Computer vision" normalization layer (to be continued)