In-depth interpretation of resnet

Last Update:2018-05-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

design purpose of residual network

With the increase of network depth, there will be a degradation problem, that is, when the network becomes more and more deep, the accuracy of training will tend to moderate, but the training error will become larger, which is obviously not over-fitting, because over-fitting means that the network training error will continue to be small, but the test error will become larger. To address this degradation, ResNet was proposed. Instead of fitting the desired feature mappings directly with multiple stacked layers, we use them to fit a residual mapping explicitly. Assuming that the desired feature map is H (x), then the stacked nonlinear layer is fitted with another mapping, i.e. f (x) =h (x)-X. It is assumed that the optimal residual mapping is easier than the optimal desired mapping, that is, f (x) =h (x)-X is easier to optimize than f (x) =h (x), and in extreme cases the desired mapping is to fit an identity mapping, when the task of the residuals network is to fit f (x) = 0, and the normal network is to fit F (x) =x, Obviously the former is much easier to optimize.

residual block

Define a residual block in the form of Y=f (X,WI) +x, where x and y are the input and output vectors of the residuals block respectively, F (X,WI) is the residual mapping to learn, there are 2 layers, f=w2σ (w1x), Σ is the relu activation function, in this expression for convenience, omitted bias, The shortcut connections here is an identity map, and the identity mapping is used because it does not introduce additional parameters and computational complexity. The residual function f is flexible, and the residual block can also have 3 layers, but if the residual block has only one layer, then y=w1x+x, which is just a linear layer, the residual block of 3 layers as shown below.

In general, we call this 3-layer residual block as ' bottleneck block ', where 1x1 convolution plays a role of dimensionality reduction, and introduces more nonlinear transformations, which significantly increases the depth of residual blocks and can improve the representation of residual networks.

Advantages of the residual network

The difference between the residual network and the ordinary network is the introduction of jumping connection, which can make the information of the last residual block flow into the next residual block, improve the flow of information, and also avoid the problem of vanishing gradient and degradation caused by too deep with the network.

Suppose there is a large neural network big NN, its input is x, the output activation value is Al, then if we want to increase the depth of this network, and then give this network additional two layers, the final output is al+2, you can think of these two layers as a residual block, and with a shortcut connection, The activation function used throughout the network is Relu.

a^{L+2=g (z^{L+2+a^{L), where Z^{^{l+2=w ^{l+2< Span style= "Font-size:large" >a ^{l+1+b sup style= "Font-size:large" >l+1 ^{l+2=0,b ^{l+1=0, then a ^{l+2=g (a ^{l), when a ^{l>=0, a ^{l+ 2=a ^l}}}}}}}}}}}}}

structure of the residual network

There is a total of 5 residual network structure, the depth is 18,34,50,101,152. First through a 7x7 convolution layer, followed by a maximum pooling, then is the stacking residual block, wherein 50,101,152 layer of residual network using the residual block is the bottleneck structure, The number of residual blocks in each network is 8,16,16,33,50 from left to right. Finally at the end of the network a global average pooling is usually connected. The advantage of global average pooling is that no parameters need optimization to prevent overfitting, the spatial transformation of input and output is more robust, and the consistency of feature mapping and classification is enhanced.

the nature of the residual network

The residual network is actually composed of a plurality of shallow networks, it does not solve the vanishing gradient problem fundamentally, but avoids the vanishing gradient problem, because it is composed of a plurality of shallow networks, the shallow network does not have the vanishing gradient problem in training, so it can accelerate the convergence of the network.

In-depth interpretation of resnet

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

In-depth interpretation of resnet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

In-depth interpretation of resnet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support