ResNet in 2015, and has affected the development of DL in academia and industry for 2016 years. Here is the network structure of this resnet, we have a sneak peek.
It makes a reference for each layer's input, learning to form residual functions, rather than learning some functions without reference. This residual function is more easily optimized, which can greatly deepen the network layer number.
We know that in computer vision, the "grade" of the feature becomes higher with the deepening of the network depth, and the study shows that the depth of the network is an important factor to achieve good results. However, gradient dispersion/explosion becomes a barrier to training deep-seated networks, resulting in an inability to converge.
There are some ways to make up, such as the normalization of initialization, the input normalization of each layer, so that the network can converge to the depth of the original 10 times times. However, although convergence, but the network is beginning to degenerate, that is, increasing the number of network layers has led to greater errors, such as. This deep plain net convergence rate is very low.
Indeed, by overlaying y=x layers (called identity mappings, identity mappings) on a shallow network, you can increase the network with depth without degradation. This reflects the inability of multilayer nonlinear networks to approximate the identity mapping network.
However, the non-degradation is not our goal, we want to have a better performance of the network. ResNet learns that the residual function f (x) = H (x)-X, if f (x) = 0, is the identity map mentioned above. In fact, ResNet is a special case of "shortcut connections" in connections that is under the identity map, which does not introduce additional parameters and computational complexity. If the optimization objective function is to approximate an identity mapping, rather than a 0 mapping, then learning to find a disturbance to the identity map is easier than learning a mapping function again. as can be seen, the residual function generally has a smaller response fluctuation, indicating that the identity map is a reasonable preprocessing.
The structure of the residual block is as follows,
It has two layers, the following expression, where σ represents the nonlinear function Relu
And then through a shortcut, and a 2nd relu, get the output y
When you need to change the input and output dimensions (such as changing the number of channels), you can do a linear transformation of the X in the shortcut WS, as follows, but the experiment proves that X is enough, no need to make a dimension transformation, unless the requirement is the output of a particular dimension, A dashed line in the ResNet network structure diagram at the beginning of the article doubles the number of channels.
Experimental results show that this residual block often requires more than two layers, and a single layer of residual block (y=w1x+x) can not play a role in Ascension.
The residual network does solve the problem of degradation, and on both the training set and the check set, the lower the deeper network error rate is proved, such as
In practice, considering the cost of the calculation, the residual block is calculated and optimized, the two 3x3 convolution layer is replaced with 1x1 + 3x3 + 1x1, such as. The middle 3x3 convolution layer in the new structure is first reduced under one reduced-dimensional 1x1 convolution layer and then restored under another 1x1 convolution layer, maintaining both precision and reduced computational capacity.
Here is ResNet's transcript, winning the championship in Imagenet2015.
ResNet principle Detailed