On the understanding of residual network ResNet

Source: Internet
Author: User

Deep residual Learning for Image recognition this paper is famous

After reading the views of everyone http://www.jianshu.com/p/e58437f39f65, also want to talk about their reading after the understanding



Network depth is a major factor affecting the performance of deep convolution neural networks, but the researchers found that when the network deepened, the results of the training were not good. This is not because of the fitting, because the fitting words should be the result of the training set, the test set is not good, but the depth of the network phenomenon is the training set on the effect is not good. And the phenomenon will become worse with depth. This is not logical, because a deep network can be trained by adding an identity transformation to a function on a shallow network. And the deep network obviously does not learn this identity transformation. Therefore, ResNet was proposed.

Network structure is a lot of block composition, each block is composed of the following figure, add a shortcut connections from the function is to add an identity transformation.



From the forward propagation point of view, the introduction of the identity transformation can make the network parameters to adjust the role of greater. This place is quoted under a particularly good answer (HTTP://WWW.JIANSHU.COM/P/E58437F39F65)

"F is the network map before summation, H is the network mapping from input to summation." For example, to map 5 to 5.1, then the introduction of residuals is F ' (5) = 5.1, after the introduction of residuals is H (5) =5.1, H (5) =f (5) +5, F (5) = 0.1. Here the F ' and F both represent network parameter mappings, and the mapping of residuals is more sensitive to the change of output . For example, s output from 5.1 to 5.2, mapping f ' output increased 1/51=2%, and for residual structure output from 5.1 to 5.2, map f is from 0.1 to 0.2, increased by 100%. Obviously the latter output change to the weight adjustment effect is bigger, therefore the effect is better. The idea of residuals is to remove the same main part, so as to highlight small changes, see residual network My first reaction is the differential amplifier.

I think the answer to this friend is very vivid.

So what is more sensitive, I think from the reverse of the spread of the phenomenon is "gradient disappear to solve." Gradient is used to update weight parameters to make the network fit better, with error term, and the error term is actually the sensitivity of network loss value (I understand). So, added a short connections from the reverse propagation, to the error term to a direct to the front layer of the propagation and add, to alleviate the gradient reduction problem. thereby resolving the gradient disappears.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.