In-depth interpretation of resnet

Source: Internet
Author: User

design purpose of residual network

With the increase of network depth, there will be a degradation problem, that is, when the network becomes more and more deep, the accuracy of training will tend to moderate, but the training error will become larger, which is obviously not over-fitting, because over-fitting means that the network training error will continue to be small, but the test error will become larger. To address this degradation, ResNet was proposed. Instead of fitting the desired feature mappings directly with multiple stacked layers, we use them to fit a residual mapping explicitly. Assuming that the desired feature map is H (x), then the stacked nonlinear layer is fitted with another mapping, i.e. f (x) =h (x)-X. It is assumed that the optimal residual mapping is easier than the optimal desired mapping, that is, f (x) =h (x)-X is easier to optimize than f (x) =h (x), and in extreme cases the desired mapping is to fit an identity mapping, when the task of the residuals network is to fit f (x) = 0, and the normal network is to fit F (x) =x, Obviously the former is much easier to optimize.

residual block

Define a residual block in the form of Y=f (X,WI) +x, where x and y are the input and output vectors of the residuals block respectively, F (X,WI) is the residual mapping to learn, there are 2 layers, f=w2σ (w1x), Σ is the relu activation function, in this expression for convenience, omitted bias, The shortcut connections here is an identity map, and the identity mapping is used because it does not introduce additional parameters and computational complexity. The residual function f is flexible, and the residual block can also have 3 layers, but if the residual block has only one layer, then y=w1x+x, which is just a linear layer, the residual block of 3 layers as shown below.

In general, we call this 3-layer residual block as ' bottleneck block ', where 1x1 convolution plays a role of dimensionality reduction, and introduces more nonlinear transformations, which significantly increases the depth of residual blocks and can improve the representation of residual networks.

Advantages of the residual network

The difference between the residual network and the ordinary network is the introduction of jumping connection, which can make the information of the last residual block flow into the next residual block, improve the flow of information, and also avoid the problem of vanishing gradient and degradation caused by too deep with the network.

Suppose there is a large neural network big NN, its input is x, the output activation value is Al, then if we want to increase the depth of this network, and then give this network additional two layers, the final output is al+2, you can think of these two layers as a residual block, and with a shortcut connection, The activation function used throughout the network is Relu.

aL+2=g (zL+2+aL), where Zl+2=w l+2< Span style= "Font-size:large" >a l+1+b sup style= "Font-size:large" >l+1 l+2=0,b l+1=0, then a l+2=g (a l), when a l>=0, a l+ 2=a l

structure of the residual network

There is a total of 5 residual network structure, the depth is 18,34,50,101,152. First through a 7x7 convolution layer, followed by a maximum pooling, then is the stacking residual block, wherein 50,101,152 layer of residual network using the residual block is the bottleneck structure, The number of residual blocks in each network is 8,16,16,33,50 from left to right. Finally at the end of the network a global average pooling is usually connected. The advantage of global average pooling is that no parameters need optimization to prevent overfitting, the spatial transformation of input and output is more robust, and the consistency of feature mapping and classification is enhanced.

the nature of the residual network

The residual network is actually composed of a plurality of shallow networks, it does not solve the vanishing gradient problem fundamentally, but avoids the vanishing gradient problem, because it is composed of a plurality of shallow networks, the shallow network does not have the vanishing gradient problem in training, so it can accelerate the convergence of the network.

In-depth interpretation of resnet

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.