design purpose of residual network
With the increase of network depth, there will be a degradation problem, that is, when the network becomes more and more deep, the accuracy of training will tend to moderate, but the training error will become larger, which is obviously not over-fitting, because over-fitting means that the network training error will continue to be small, but the test error will become larger. To address this degradation, ResNet was proposed. Instead of fitting the desired feature mappings directly with multiple stacked layers, we use them to fit a residual mapping explicitly. Assuming that the desired feature map is H (x), then the stacked nonlinear layer is fitted with another mapping, i.e. f (x) =h (x)-X. It is assumed that the optimal residual mapping is easier than the optimal desired mapping, that is, f (x) =h (x)-X is easier to optimize than f (x) =h (x), and in extreme cases the desired mapping is to fit an identity mapping, when the task of the residuals network is to fit f (x) = 0, and the normal network is to fit F (x) =x, Obviously the former is much easier to optimize.
residual block
Define a residual block in the form of Y=f (X,WI) +x, where x and y are the input and output vectors of the residuals block respectively, F (X,WI) is the residual mapping to learn, there are 2 layers, f=w2σ (w1x), Σ is the relu activation function, in this expression for convenience, omitted bias, The shortcut connections here is an identity map, and the identity mapping is used because it does not introduce additional parameters and computational complexity. The residual function f is flexible, and the residual block can also have 3 layers, but if the residual block has only one layer, then y=w1x+x, which is just a linear layer, the residual block of 3 layers as shown below.
In general, we call this 3-layer residual block as ' bottleneck block ', where 1x1 convolution plays a role of dimensionality reduction, and introduces more nonlinear transformations, which significantly increases the depth of residual blocks and can improve the representation of residual networks.
Advantages of the residual network
The difference between the residual network and the ordinary network is the introduction of jumping connection, which can make the information of the last residual block flow into the next residual block, improve the flow of information, and also avoid the problem of vanishing gradient and degradation caused by too deep with the network.
Suppose there is a large neural network big NN, its input is x, the output activation value is Al, then if we want to increase the depth of this network, and then give this network additional two layers, the final output is al+2, you can think of these two layers as a residual block, and with a shortcut connection, The activation function used throughout the network is Relu.
aL+2=g (zL+2+aL), where Zl+2=w l+2< Span style= "Font-size:large" >a l+1+b sup style= "Font-size:large" >l+1 l+2=0,b l+1=0, then a l+2=g (a l), when a l>=0, a l+ 2=a l
structure of the residual network
There is a total of 5 residual network structure, the depth is 18,34,50,101,152. First through a 7x7 convolution layer, followed by a maximum pooling, then is the stacking residual block, wherein 50,101,152 layer of residual network using the residual block is the bottleneck structure, The number of residual blocks in each network is 8,16,16,33,50 from left to right. Finally at the end of the network a global average pooling is usually connected. The advantage of global average pooling is that no parameters need optimization to prevent overfitting, the spatial transformation of input and output is more robust, and the consistency of feature mapping and classification is enhanced.
the nature of the residual network
The residual network is actually composed of a plurality of shallow networks, it does not solve the vanishing gradient problem fundamentally, but avoids the vanishing gradient problem, because it is composed of a plurality of shallow networks, the shallow network does not have the vanishing gradient problem in training, so it can accelerate the convergence of the network.
In-depth interpretation of resnet