Deep Residual network Interpretation (MSRA 152-layer network)

Source: Internet
Author: User

Original URL:

Http://caffecn.cn/?/article/4


2015_arxiv_deep residual Learning for Image recognitionFirst of all thank you @ Sing invited, at the end of the study here to read the "Deep residual Learning for Image recognition" a text of the experience and you share, do a good job, hope to be widely discussed.
In the specific introduction of the article thought before, first look at "depth of residual learning" (Deep residual Learning) illustrious exploits: Msra with this sharp weapon in the imagenet contest identification, detection and positioning of three tasks, As well as the Coco competition in the detection and segmentation of the task to get the first place, which is mainly due to the idea of residual learning so that learning a deeper network is possible, so as to learn better expression.
So what is the depth of residual learning it.
It has long been known that deeper networks can produce better data expression, but how to train a deep network has always been a problem for people, mainly due to gradient disappearance or explosion and uneven scale initialization. Around this issue, a series of methods such as Relu, Xavier, Prelu, batch normalization and PATH-SGD are presented (see the second offline activities of the community, Dr. Liu Xin), but the author of this article, He Cai, finds that even with these methods, The training of neural networks still presents a degradation phenomenon. The so-called degradation phenomenon is that with the increase of network depth, the performance of the network decreases, and the decline of this performance is not caused by the problems mentioned above. See Figure 1, the 56-layer network is larger than the 20-tier network in both training error and test error.


Figure 1 [Photo selected from the original paper, copyright belongs to the original author]

This phenomenon is unreasonable. If there is a good network A that can work, here comes a deeper network B, just make the first part of B exactly the same as a, and the latter part only implements an identity map, so B can at least get the same performance as a, not worse. The thought of the deep residual learning is also generated, since the part B is complete with the identity map, why not add this prior to the training network, and then construct the network with the shortcut connection, that is, the output of each layer is not the input map of the traditional neural network, but the mapping and input superposition, As shown in Figure 2.


Figure 2 [Photo selected from the original paper, copyright belongs to the original author]

This is the thought of deep residual learning, of course, there are some details on the implementation, such as the input and output dimensions are different how to deal with, the article in the realization of the network is deeper but the parameters are less than Vgg, the network design principles, these details please read the paper sec.3.3, also welcome you to discuss.
Finally, I'll talk a little about my experience in this article:
1. In the course of network training, it is very important to add a priori information guidance, and a reasonable priori will often achieve very good results. The identity map in this article is an example, and here's an example, Meina Kan and other people's 2014CVPR stacked progressive auto-encoders (SPAE) for the face Recognition in this paper, when using the depth neural network to make a cross pose face recognition problem, we add a priori information that the face attitude is a gradual process, which improves the performance of the network significantly.
2. If you read the article "Highway Network", you will find that the depth of residual learning is a special case of highway, but this does not affect the contribution of this article, can be a good idea to achieve, to achieve good performance is also very difficult. In addition, this paper gives a highway intuitive explanation to some extent.
3. The research of neural network can be divided into two parts: network structure and training mechanism, Dr. Liu Xin the image of them as computer hardware and software, the line between the computer software and hardware is increasingly blurred, as this paper puts forward the depth of residual learning is the same, From the network structure to understand the equivalent of the traditional CNN plus shortcut connection, from the training mechanism to understand, this article in the training process to add identity mapping of the prior information, equivalent to a new training mechanism.
4. The sec.4.2 experiment part of the paper challenges the limit, designed a super large-scale (1202-layer) network, the use of deep residual learning can also make the network convergence, but the performance is not as good as the 110-tier network, the main reason is the relatively small amount of data, so the actual application, we need the size of the network and the amount of data to consider.
Finally, we thank the Caffe community for providing us with a platform for communication and learning, which has benefited us a lot.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.