Res-family:from ResNet to Se-resnext

Source: Internet
Author: User
Tags repetition

Res-family:from ResNet to Se-resnext

Liaowei
http://www.cnblogs.com/Matrix_Yao/

  • Res-family:from ResNet to Se-resnext
    • ResNet (DEC)
      • Paper
      • Network Visualization
      • Problem Statement
      • Why
      • Conclusion
      • How to Solve it
      • Breakdown
        • Residule Module
        • Identity Shortcut and Projection Shortcut
        • Tricks
      • Mind Experiment
      • Another perspective
    • Resnet-v2 (Jul)
      • Paper
      • Network Visualization
      • Motivation
      • The Ways
        • The importance of Identity Shortcut
        • Proposed Architecture
      • Mind Experiment
    • Resnext (APR)
      • Paper
      • Network Visualization
      • Motivation
      • The Ways
        • Beauty comes from simple repetition.
        • Determination of C and D
        • One more thing
        • Effect
      • Mind Experiment
    • Se-resnet, Se-resnext (2018 APR)
      • Paper
      • Network Visualization
      • Motivation
      • How it works
      • Implementation Notes
      • Mind Experiment
ResNet (DEC) Paper

Deep residual learning for Image recognition

Network Visualization

Dgschwend.github.io/netscope/#/preset/resnet-50

Problem Statement

A paradox between neural network depth and its representation capability.

    • Intuition:
      • Deeper the network, stronger the representation capability
    • observation
      • Network performance'll degrade while network is deeper

Why
    • Overfit
    • Gradient Vanish checked back propogated diff, also BN can secure this.
    • conjecture: Plain nets may has exponentially low convergence rate, which impact the reducing of the Trainin G error.
Conclusion

Current plain network design impede US pursue bigger representation capability through make network deeper.

How to Solve it

It is possible to construct a depth model in which the performance is at least equal to the corresponding shallow model by means of a constructive method. When the add-on block output is 0 o'clock, the performance of this deeper net is consistent with the performance of shallower net.


From the point of approximation of the function, suppose we need to approximate the function, shallower net gives a, that add-on block can be regarded as approximation, i.e. residuals (residual), which is essentially residual learning ( Residual learning) or the idea of boosting. This is also the basic idea of resnet.

Breakdownresidule Module


The right block is called bottleneck architecture.

Identity Shortcut and Projection Shortcut

In the topology diagram above, the solid line represents the identity shortcut, and the dashed lines represent the projection shortcut. The reason for the projection shortcut is that the internal operation of the module changes the dimension of the feature map (height, width, channel) and we need a projection to match Dimension F refers to the output channel number of the module.

Tricks

The relationship between H/W and C is: spatial every time you do a down sample, C is multiplied by 2, so

Mind Experiment
      • operator Fusion
        • vertical Fusion
          • bn folding
            • conv + BN + scale Shift-Conv
          • conv + relu + pooling
          • conv + eltsum + relu
        • horizontal Fusion
          • multi-branch Fusion
      • advanced Tech
        • lossless topology compression
      • kernel = 1 Pool ing optimization
Another perspective

We can treat ResNet as a stacked boosting Ensemble.

Resnet-v2 (Jul) Paper

Identity Mappings in deep residual Networks

Network Visualization

Http://dgschwend.github.io/netscope/#/gist/6a771cf2bf466c5343338820d5102e66

Motivation

When we express ResNet as a general formula:

In the ResNet , so called shortcut. Another term is highway, because Information on this path can be transmitted without attenuation, just like a freeway. But in ResNet, because is a relu will lose a part of Information, so The non-attenuation transmission can only be confined to this block, and can not be transmitted in the whole network without attenuation. Now the conjecture is: If we turn it into an identity mapping, will this highway be smoother and the effect will be better?

The waysthe importance of Identity Shortcut

The author tries a lot of ways to contrast with the identity shortcut, and discovers that the effect is not shortcut good.

Proposed Architecture


The above structure is referred to as the pre-activation residual Module, since the previous residual block is a CBR (Conv, BN, ReLU) structure, and this is the BRC (BN, ReLU, CONV) structure, Activation before conv, so called pre-activation structure.
Therefore, RESNET-V2 can be expressed as follows:

That , the recursive type is expanded to include:
This is a pure additive model, more like stacked boosting than resnet. Effects such as:

Mind Experiment
  1. resnet-v2 What is the difference between inference optimization and ResNet?
    • new Vertical Fusion
      • bn + ReLU
      • conv + eltsum
  2. Some students will think that the above in the B-Chart of the two branch because scaling is 0.5, compared with the original residual block is only equivalent to multiply a 0.5, in mathematics is equivalent, why the effect will be much worse? That's half the sentence, as far as a single module is concerned.
    But once the module is strung together to form a network, things have changed in a qualitative way. Without losing its generality, we formalize the constant scaling module as follows (here, in order to simplify the derivation, I changed the Relu to an identity that does not affect the final conclusion):
    When the recursion is expanded, it is:
    It can be seen that the constant scaling has an exponential attenuation of the input characteristics compared to the original expansion, which is not equivalent at all.
Resnext (APR) Paper

Aggregated residual transformations for deep neural Networks

Network Visualization

Chakkritte.github.io/netscope/#/preset/resnext-50

Motivation

With the rise of deep learning, the research of visual recognition has shifted from feature engineering to network engineering, that is, design topology. With the increase of depth, topology parameters (such as convolution kernel size, stride,channel number, etc.) are more and more difficult to determine. The success of Vgg and ResNet shows that stacking blocks of the same shape can not only significantly reduce the number of super-parameters, but also achieve sota results, which is a very promising direction (this feeling is similar to the idea of fractals). This is the direction one.
The practice represented by Googlenet/googlenet-v2/inception also shows that the fine network design through the Split-transform-merge strategy can achieve very good results. This is direction two.


Resnext's idea is to rub these two good ideas together and see if the results will be better. Paper, we present a simple architecture which adopts vgg/resnets ' strategy of repeating layers, while exploiting The Split-transform-merge strategy in a easy, extensible. "

The ways beauty comes from simple repetition

If you do split-transform-merge like the googlenet system, because each branch needs to design the convolution core's parameters, as well as the depth of the branch, so the super-conference expands rapidly. Therefore, to think of a way, both do the split-transform-merge, there is not much of the increase in parameters, this time in fact, the use of "repeating Dafa" to achieve. Let's take a look at the design of Resnext:


You can see that this structure is a unit that repeats 32 times and then adds the result. Each unit is a bottleneck structure: the input feature map is embedding to the feature map of the 4 channel via a 1x1 convolution, then a 3x3 convolution is made and then expand back. We call this structure the structure of 32x4d. 32 of them are new degrees of freedom introduced by Resnext, called cardinality. Resnext's name also stems from this, and X refers to neXt dimension. Finally, don't forget to residual shortcut.

Determination of C and D

In order to obtain a resnext equivalent to the corresponding number of resnet parameters, we can determine the C and D of the Resnext block according to the following formula:

One more thing

Further, by equivalent deformation, resnext can be regarded as a model encompassing the googlenet/grouped convolution/resnet. In this way, the level of the Resnext thesis is detached from the ResNet optimization.


A is the original form of the Resnext module, B is the equivalent type of the Googlenet form, and C is the equivalent of group convolution. From a performance (speed) Point of view, C is the best.

Effect

Finally, the author gives the top-1 error, and can see that there is a significant improvement.

Mind Experiment
    1. Resnext is basically the same as resnet from the final realization, the only difference is that the 3x3 convolution in residual module is replaced by group convolution.
Se-resnet, Se-resnext (2018 APR) Paper

Squeeze-and-excitation Networks

Network Visualization

Http://dgschwend.github.io/netscope/#/gist/4a41df70a3c7c41f97d775f991f036e3

Motivation

Beginning in 2017, Google said "Attention is all need", the world Heroes race to respond. Much of the work of 2018 is based on the progress of the attention mechanism. This article can also be regarded as one of them, because we can think of Senet as channel-wise attention. For example, Senet adds a special Branch to calculate channel-wise scale after a regular action, and then takes the resulting value to the corresponding channel.

How it works

Senet calculates the channel-wise attention, which is divided into two steps:

    • Squeeze: such as the red box. The spatial dimension of each input feature map from H * W squeeze to 1. This is done through the global average pooling.

    • Excitation: such as the green box. A bottleneck structure is adopted to capture the inter-dependency of the channel, thus learning the scale factor (or attention factor) of the channel.


Is se-resnet, you can see the SE module is apply to the residual branch.


Is the network configuration, here FC behind the square brackets inside the number representation and.


Finally, the effect can be seen senet in the convergence rate and accuracy are improved.

Implementation Notes
    1. Senet FC is equivalent to 1x1 convolution, which can be replaced by conv1x1.
    2. Channel Wise scale + eltsum can fuse into channel-wise axpy.
Mind Experiment
    1. Q: The SE module is ultimately implemented as the scale of channel wise, which is the same as bn in mathematical operations. Why can't the SE module be replaced by BN?
      A: bn only modeled the dependency in spatial, and the SE module not only modeled spatial dependency but also modeled Inter-channel's dependency.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.