As the best paper of CVPR2016, He Keming's article "1" aimed at the problem of the SGD optimization caused by the deep network gradient dispersion, proposed the residual (residual) structure, and solved the model degradation problem in 50, 101-layer, 152-or even 1202-layer network testing has been very good results.
The error rate applied to ResNet is significantly lower than in other mainstream depth networks (Figure 1)
Figure 1. ResNet network model of Champions network on ImageNet15
One obvious fact is that the deeper the network is, the stronger the ability to express it. However, as the depth of the increase, gradient dispersion of the phenomenon more obvious, resulting in SGD can not converge, the final precision is reduced (Fig 2)
Fig. 2. Conventional 56-tier networks are more accurate than 20-tier networks for both training and testing
To solve this problem, a residual (residual) structure is proposed, which can maintain a good training effect for more than 1000 layers of network (although it has been proposed at this time).
Fig. 3. The residual structure directly inputs x into the output, equivalent to introducing an identity map
As shown in Figure 3, assuming that the original network is going to learn a function of h (x) H (x), the author decomposes it into h (x) =f (x) +x h (x) =f (x) +x.
After decomposing the original network (Figure 3 vertical downward flow) fitted F (x) f (x), Disk branch (Fig 3 Curved shortcut connection)
Figure 4 The network structure for adding residuals to the VGG-19, 34-tier common network and 34-tier
Figure 4 Compared to the normal network, resnet only needs to increase the shortcut connection (the dotted line is to multiply the number of channels by 2)
The author's experiment shows that the residual structure needs more than 2 layers to be effective, and the linear transformation represented by Ws w_s in the following type is only the dimension of unified input and output, which is not helpful to improve the training effect.
Y=f (X,WI) +wsx y=f (x,{w_i}) +w_sx
Fig. 5 is the experimental effect of the author using resnet
Figure 5. ResNet Network (right) and normal network (left) training error why ResNet work
Two questions need to be answered: First, why it is decomposed into H (x) =f (x) +x h (x) =f (x) +x (the actual optimization is f (x) =0 f (x) =0), and the second is why the decomposition can solve the problem of gradient slack
for the first question , the author does not explain the principle, but through experiments to prove that it is optimal. The answer to the question of why X instead of 0.5x or another is: "Practice finds that machine learning to fit (target function) function f (x) is often very close to the same mapping function." "But I can't understand.
for the second question , the following 3 explanations are available:
1 The optimization of f (x) =0 f (x) =0 has a natural advantage because the initial value of network weights is often near 0.
The analogy is: Suppose the function to be fitted is a straight line (here is H (x) h (x)), then using a straight line (residual structure x x) and some tiny polylines (f (x) f (x)) to stack is definitely easier to optimize than simply using a straight line or a line fit.
2 For example, mapping 5 to 5.1
If the normal network, then is f′ (5) =5.1 F ' (5) =5.1
After introducing residuals, H (5) =5.1 H (5) =5.1, H (5) =f (5) +5 H (5) =f (5) +5, then F (5) =0.1 f (5) = 0.1
As you can see, normal network input output "gradient" is only 2%, and the residual network map f f increased by 100%
3 This article from another perspective to understand residuals: as a voting system
Fig. 6. Residual network can be decomposed into multiple-clock path combination networks
As shown in Figure 6, the residual network is actually a combination of many parallel subnets. So although the surface of the resnet can be very deep, but the combination of most of the network path is actually several in the middle of the path length.
Figure 7 Multiplies the number of networks that are included in each path length by the gradient value of each path, and counts the path that ResNet really works on.
Figure 7. The path that really works is less than 20 levels long
Therefore, "resnet only looks very deep on the surface, in fact the network is very shallow." "ResNet does not really solve the problem of the gradient of the depth network, its essence is a multiplayer voting system." Code Implementation
The author releases the network model under Caffe on the GitHub, and introduces the implementation of the third party in other platforms.
"1" he K, Zhang X, Ren S, et al. Deep residual learning for image recognition[c]//proceedings of the IEEE Conference on Computer vision and Pattern recognition. 2016:770-778.