Deep Learning--msra Initialization

Source: Internet
Author: User

This brief introduction to the MSRA initialization method is also derived from He Keming paper delving deep into rectifiers:surpassing human-level performance on ImageNet Classification ".

      • Motivation
      • MSRA initialization
      • Derivation proof
      • Additional Information

Motivation

Network initialization is a very important thing. However, the traditional Gaussian distribution of fixed variance is initialized, which makes the model difficult to converge when the network becomes deep. In addition, the Vgg team dealt with the initialization problem: they first trained a 8-tier network and then used the network to initialize deeper networks.

"Xavier" is a relatively good initialization method, which I have described in my other blog post, "Deep learning--xavier initialization method". However, when the Xavier derivation assumes that the activation function is linear , it is clear that the relu and prelu that we commonly use today do not satisfy this condition.

MSRA Initialization

When only the number of inputs is considered, MSRA initialization is a Gaussian distribution with a mean of 0 variance of 2/n:

Derivation Proof

The derivation process is similar to Xavier.

First, the following formula is used to represent the L-layer convolution:

The variance is: (assuming that X and W are independent and each element is the same distribution, that is, the n_l in the following formula represents the number of input elements, both x_l and w_l represent a single element)

When the weight W satisfies the 0 mean, the above variance can be further written as:

For the Relu activation function, we have: (where f is the activation function)

The variance formula that was brought in before is:

In order to keep the variance of each layer of data consistent, the weights should satisfy:

Additional Information

(1) for the first layer of data, because it has not been relu before, so theoretically this layer of initialization variance should be 1/n. However, because there is only one layer, the coefficient is almost insignificant, so in order to simplify the whole operation, the 2/n variance is adopted;

(2) reverse propagation needs to be considered exactly like "Xavier". For the reverse propagation, can also be the above deduction, the final conclusion is still the variance should be 2/n, but because it is reversed, here n is no longer the number of inputs, but the number of outputs. Both of these methods can help the model converge, the article says.

(3) for the Prelu activation function, the condition becomes:

So initialization is related to Prelu, but the current Caffe code does not support manually specifying the value of a at MSRA initialization.

(4) This paper makes some comparative experiments, which shows that MSRA initialization is significantly better than Xavier initialization after network deepening.

Especially when the network is increased to 33 layers, the contrast effect is more obvious

Deep Learning--msra Initialization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.