"Paper notes" Training Very deep Networks-highway Networks

Last Update:2018-01-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Target:How to train a deep neural network however, deep neural networks can cause problems, gradients, and so on, which makes it difficult to train authors to take advantage of similar lstm methods, by increasing the gate to control the ratio of transform before and after transform, calledHighway NetworkAs to why it works ... Probably the same reason Lstm will work. Method:The first is the normal neural network, each layer h from the input x mapping to the output y,h usually contains an affine transformation and a nonlinear transformation, as follows

On this basis, the highway network added two gate1) t:trasform Gate2) C:carry Gate after the added layer output is as follows:
It can be seen that T and C control is the ratio between x and H, in order to simplify, set c=1-t
Obviously, the value of y:

The corresponding derivative is

So there's one more question, what is the form of this transform gate, and the author uses a method similar to the affine transformation +sigmoid in LSTM:

where B's value is initialized to a negative value ( -1~-3), so the initial state of carry gate will be larger, meaning that the output y will be more biased toward X. In addition, since the input x in each layer is required to have the same size as the output y, there may be a mapping layer between each layer, which maps the previous layer to the input size of the next layer. Experiment:All experiments using the driving amount of SGD, the learning rate with exponential attenuation, each layer of H by affine and relu composition, the text provides the source code: http://people.idsia.ch/~rupesh/very_deep_learning/First is the network layer of the experiment

Highway refers to this method, plain refers to the normal neural network, it can be seen that the deep neural network, highway results much better. This also means that the method of adding transform gate is effective. There are also some experiments related to the accuracy/number of layers/parameters:

Analysis:First look at the activation of the relevant parameters of each gate
The first column in the bias of the gate, CIFAR data set, bias as the number of layers constantly increasing, which means that the first few layers are affected by the original input is relatively large, the latter layer is affected by the activation function h greater. The second and third columns represent the output of the transform gate, in a sense, most of the gate is close, and only the input is passed directly to the output, and only a few are active. The last column is the output, as you can see, the input and output have not changed too much, the same block on the basic positive and negative. From the above results, it is more like skip connection, the input results will not affect each layer, but in the direct Transfer n layer after the back of a certain layer, which is a bit like a highway, most cars in one lane, and occasionally change lanes. So is this "change path" fixed? The answer is no, the data will still choose the right place to change the path, as shown in
Another question, since the number of really active content is so small, does that mean that many layers do not contribute? Answer: To see the problem for mnist, because the problem is relatively simple, so it is to remove 60% of the layer, can still have satisfactory results but for complex problems, such as cifar, casually delete some layers will lead to a significant decline in results, which also shows that the depth of the problem is more important for complex (nonsense Summary:This article is not so much how to construct a deeper neural network, rather how to help route information in the network, different information should be activated in different layers, not all on the same layer. And unlike the general skip-connection, this cross-layer relationship is not fixed, is obtained through learning, so the adaptability of the problem should be stronger.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Paper notes" Training Very deep Networks-highway Networks

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Paper notes" Training Very deep Networks-highway Networks

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support