Deep residual network and highway Network _ Depth Learning

Source: Internet
Author: User

Today's two network structures are the latest in the industry for image processing problems proposed by the latest structure, the main solution is the Super deep network in training optimization problems encountered. To tell the truth, both models are not mathematically complex in themselves, but it does have a very good effect in combat (the deep residual network helps Microsoft's team to gain the 2015 Image Cup championship in an absolute way), which illustrates that deep learning is a practice-led discipline, Practice in this field is the only criterion for testing truth. (Many of the new structures are due to the good results in practice, and then some of Daniel through some of the tall concept of packaging, and finally in a very good posture passed to us, so that we worship).

First of all, for the deep residual network, the following is the architecture diagram for the deep residual network
(from the paper "Deep residual Learning for Image recognition")

is said to be named "residual" network, Because the hypothesis that the network is to learn is H (x), and since the identity X in the graph crosses 2 layers, the equivalent of the fitting is F (x) =h (x)-X, which is the source of the concept of residuals, which is the argument in the paper. Actually, I feel like when the author put forward this structure, it broke the traditional neural network n-1 layer output can only give n-layer as input, so that a layer of output directly across several layers as the next layer of input. At first glance, this structure is nothing, seemingly nothing particularly powerful, it's not.

The diagram above is the conceptual source map of the network of structural depth residuals, a 56-layer network is a 20-layer network, theoretically speaking, in fact, 56-layer network solution space includes 20-layer network solution space, In other words, 56-tier network performance should be greater than or equal to 20-tier network performance. However, from the training of the iterative process, the 56-tier network, whether from the training error or test error, the error is greater than 20 layers of the network (this also explains why this is not a fitting phenomenon, because the 56-layer network itself training error has not been down). The reason is that although the solution space of the 56-layer network contains the solution space of the 20-layer network, we use the stochastic gradient descent strategy in the training network, often the solution is not the global optimal solution, but the local optimal solution, obviously the solution space of the 56-layer network is more complex, Therefore, it is impossible to solve the optimal solution by using the stochastic gradient descent algorithm.
In fact, in the construction of this network, we can completely change the idea, if the 20-tier network can achieve very good results, I am in the construction of 56-layer network of the first 20 layers from the 20-layer network copy over, the back of the 36 layer only do inentity The map has at least the effect of not being inferior to a 20-tier network. As a result of the network of depth residuals, this idea is actually not complicated, and plainly broke every layer of network input can only come from the previous layer of network output rules, you can let some network output directly skip several layers to the back of the input. Such a network does have a very good effect, too. In addition to note that, in real training, there are several trick to note: 1, pay attention to the use of batch-normalization technology between layers, or because the network too deep will lead to the problem of gradient disappearance, resulting in network training can not converge; 2, the paper said, In order to maintain the parameters of each layer network, the number of the filter is increased by one time whenever the dimension of the input of the pooling layer is reduced.

After saying the deep residual network, let's talk about highway network. This network originates from the paper "Highway Networks"
The so-called highway network, nothing more than the input of a network of data part through Non-linear transformation, the other part directly from the network across the past does not make any conversion, want to walk on the highway, and how much data need non-linear transformation, how much data can be directly across the past, is determined by a weighted matrix and input data. The following is the construction formula for the highway network:
Y=h (X,WH) ⨀t (X,WT) +x⨀c (X,WC)
The y vector is made up of two items. T is called transform Gate, and C is called Carry gate. The activation functions of C and T are all sigmoid functions.
T calculates a vector (a1,a2,......an) in which each digit is a floating-point number (0,1) representing the proportion of the content in Y that is changed by x;
c is also a vector (a1,a2,......an), where each digit is also a floating-point number (0,1), which represents the proportion of Y in the content of x itself;
(For simplicity's sake, sometimes a c=1−t, 1 represents a vector with a dimension and a T. From the formula we need to be aware that, since it is dot, when C (X,WC) takes 1−t (X,WT) then X,y,h (X,WH), T (X,WT) must be the same dimension. If we want to change the dimension of x from A to B, one method is to take the zero-padding and the down sampling method, or introduce a transformation matrix with a dimension of a*b, so that each time multiply the matrix.

It mainly solves the problem of training convergence of multilayer depth neural network, even though many layers can be trained by simple method, such as backpropogation, to ensure the convergence within reasonable iterative range, and traditional network is difficult to guarantee convergence. As shown in the following illustration:

When the network is very deep, the network that uses highway is more easy to converge.

In the original text said:
A highway layer can smoothly vary its behavior between it plain layer and that's a layer which simply its I Nputs through.
In other words, highway is to let the input data part of the transformation, the other part directly through, equivalent to the overall effect in the two selected a balance.

In a broad sense, highway is more like a thought, it can be used not only in a fully connected network, but also in convolution neural network, the original said: "Convolutional highway layers are constructed to Fully connected layers. Weight-sharing and local receptive fields are utilized for both H and T transforms. We use zero-padding to ensure this block state and transform gate feature maps are the same size as the input.

In fact, both the deep residual network and the highway network can allow some of the data to skip some of the transformation layers, and go directly to the back of the layer, but the highway network needs a weight to control the amount of data directly through each time, and the depth of residual network directly let a part of the data through to the back. From a large number of experiments, I feel that these two networks only in a very deep scene in order to play a "power", if the network layer is relatively shallow, the reluctance to use these two structures is difficult to get good results.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.