Thesis study: Deep residual learning for image recognition

Source: Internet
Author: User

Directory

    • I. Overview
    • II. Degradation
    • Iii. Solution & deep Residual learning
    • Iv. Implementation & Shortcut connections

Home page
Https://github.com/KaimingHe/deep-residual-networks

TensorFlow implementation:
Https://github.com/tensorpack/tensorpack/tree/master/examples/ResNet

In fact, TensorFlow has built-in ResNet:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/nets/resnet_v1.py

CVPR Best Paper Award , 2018 was cited over 12,900 times.

Problem solved: make deep networks easier to train .

To ease the training of networks is substantially deeper than those used previously.

I. Overview

First, stacking more layers does make feature extraction more efficient .

Deep networks naturally integrate low/mid/highlevel features [+] and classifiers in an end-to-end multilayer fashion, and The "levels" of features can is enriched by the number of stacked layers (depth).

But the main difficulty with the network being too deep is that gradients disappear or explode :

An obstacle to answering this question is the notorious problem of vanishing/exploding gradients [1, 8], which hamper Convergence from the beginning.

The predecessors ' acceleration methods are mainly standardized layers and regular initialization :

This problem, however, have been largely addressed by normalized initialization [8, $, a] and intermediate Normalizat Ion layers [+], which enable networks with tens of layers-start converging for stochastic gradient descent (SGD) with backpropagation [22].

Specifically why the standardized layer can speed up training, refer to this blog and related papers.

When the network is deeper, a new problem arises. We call it degradation :

, the training error of deep network is higher than that of shallow network when the accuracy rate is basically saturated.
Experiments have shown that this degradation is becoming more and more severe as the network deepens .

Is this because it was fitted ?
If it is overfitting, then the training error should not rise with the network deepening (the training error should be very low).

We continue to study the problem.

II. Degradation

We first train a shallower architecture, which can output the desired results.
Then we copy the shallower architecture, plus a layer or multilayer network, to get a deeper model,

We train deeper model again.
Ideally, added layers only needs to implement the identity mapping function simply, so that the training error does not drop and even may rise.

However, the experiment confirms that thedeeper model either takes too long or is less effective than expected .
This is an experimental explanation of the problem of deep network degradation .

Iii. Solution & deep Residual learning

In order to solve the degradation problem, we introduced the deep residual learning . The fundamental idea is:

Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these Laye Rs fit a residual mapping.

For example, assuming the original mapping is \ (\mathscr H (\MATHRM x) \) , then we want the map that the nonlinear layer really learns to be:
\[\MATHSCR F (\mathrm x): = \mathscr H (\mathrm x)-\MATHSCR x\]

Go back to the example in the previous section.
We want the additional layer to learn the identity mapping, which is still very difficult to train because it is a non-linear layer .
However, if we are learning the residual mapping, that is, the total zero residuals, it is obviously much easier .

Thought is similar to SVM, but you can't think of it!!!

Iv. Implementation & Shortcut connections

Thought has, concrete how to achieve it?
Can't help: He Dashen too awesome!!!!

Back to just the example. Assume:

    1. The target mapping of added layers is \ (\mathscr h\) ;
    2. The output of the original shallower architecture is the input of \ (\mathscr h\) \ (\mathrm x\) .

To force the previous nonlinear layer to learn the residuals , we assume that the network output is the condition of the residuals.
At this point, we should sum the output \ (\mathscr H (\MATHRM x) \) of the network with the original input \ (\MATHRM x\) before calculating the loss.
So the network is as follows:

So the connection is possible, so the reverse propagation algorithm can be applied.

Of course, why learning "all 0" is simpler, without detailed theoretical analysis, and requires a lot of experimental proof.

The experimental results on the right graph show that the degradation of the left graph is effectively solved.

Thesis study: Deep residual learning for image recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.