This brief introduction to the MSRA initialization method is also derived from He Keming paper delving deep into rectifiers:surpassing human-level performance on ImageNet Classification ".
-
- Motivation
- MSRA initialization
- Derivation proof
- Additional Information
Motivation
Network initialization is a very important thing. However, the traditional Gaussian distribution of fixed variance is initialized, which makes the model difficult to converge when the network becomes deep. In addition, the Vgg team dealt with the initialization problem: they first trained a 8-tier network and then used the network to initialize deeper networks.
"Xavier" is a relatively good initialization method, which I have described in my other blog post, "Deep learning--xavier initialization method". However, when the Xavier derivation assumes that the activation function is linear , it is clear that the relu and prelu that we commonly use today do not satisfy this condition.
MSRA Initialization
When only the number of inputs is considered, MSRA initialization is a Gaussian distribution with a mean of 0 variance of 2/n:
Derivation Proof
The derivation process is similar to Xavier.
First, the following formula is used to represent the L-layer convolution:
The variance is: (assuming that X and W are independent and each element is the same distribution, that is, the n_l in the following formula represents the number of input elements, both x_l and w_l represent a single element)
When the weight W satisfies the 0 mean, the above variance can be further written as:
For the Relu activation function, we have: (where f is the activation function)
The variance formula that was brought in before is:
In order to keep the variance of each layer of data consistent, the weights should satisfy:
Additional Information
(1) for the first layer of data, because it has not been relu before, so theoretically this layer of initialization variance should be 1/n. However, because there is only one layer, the coefficient is almost insignificant, so in order to simplify the whole operation, the 2/n variance is adopted;
(2) reverse propagation needs to be considered exactly like "Xavier". For the reverse propagation, can also be the above deduction, the final conclusion is still the variance should be 2/n, but because it is reversed, here n is no longer the number of inputs, but the number of outputs. Both of these methods can help the model converge, the article says.
(3) for the Prelu activation function, the condition becomes:
So initialization is related to Prelu, but the current Caffe code does not support manually specifying the value of a at MSRA initialization.
(4) This paper makes some comparative experiments, which shows that MSRA initialization is significantly better than Xavier initialization after network deepening.
Especially when the network is increased to 33 layers, the contrast effect is more obvious
Deep Learning--msra Initialization