Target:How to train a deep neural network however, deep neural networks can cause problems, gradients, and so on, which makes it difficult to train authors to take advantage of similar lstm methods, by increasing the gate to control the ratio of transform before and after transform, calledHighway NetworkAs to why it works ... Probably the same reason Lstm will work.
Method:The first is the normal neural network, each layer h from the input x mapping to the output y,h usually contains an affine transformation and a nonlinear transformation, as follows
On this basis, the highway network added two gate1) t:trasform Gate2) C:carry Gate after the added layer output is as follows:
It can be seen that T and C control is the ratio between x and H, in order to simplify, set c=1-t
Obviously, the value of y:
The corresponding derivative is
So there's one more question, what is the form of this transform gate, and the author uses a method similar to the affine transformation +sigmoid in LSTM:
where B's value is initialized to a negative value ( -1~-3), so the initial state of carry gate will be larger, meaning that the output y will be more biased toward X. In addition, since the input x in each layer is required to have the same size as the output y, there may be a mapping layer between each layer, which maps the previous layer to the input size of the next layer.
Experiment:All experiments using the driving amount of SGD, the learning rate with exponential attenuation, each layer of H by affine and relu composition, the text provides the source code: http://people.idsia.ch/~rupesh/very_deep_learning/First is the network layer of the experiment
Highway refers to this method, plain refers to the normal neural network, it can be seen that the deep neural network, highway results much better. This also means that the method of adding transform gate is effective. There are also some experiments related to the accuracy/number of layers/parameters:
Analysis:First look at the activation of the relevant parameters of each gate
The first column in the bias of the gate, CIFAR data set, bias as the number of layers constantly increasing, which means that the first few layers are affected by the original input is relatively large, the latter layer is affected by the activation function h greater. The second and third columns represent the output of the transform gate, in a sense, most of the gate is close, and only the input is passed directly to the output, and only a few are active. The last column is the output, as you can see, the input and output have not changed too much, the same block on the basic positive and negative. From the above results, it is more like skip connection, the input results will not affect each layer, but in the direct Transfer n layer after the back of a certain layer, which is a bit like a highway, most cars in one lane, and occasionally change lanes. So is this "change path" fixed? The answer is no, the data will still choose the right place to change the path, as shown in
Another question, since the number of really active content is so small, does that mean that many layers do not contribute? Answer: To see the problem for mnist, because the problem is relatively simple, so it is to remove 60% of the layer, can still have satisfactory results but for complex problems, such as cifar, casually delete some layers will lead to a significant decline in results, which also shows that the depth of the problem is more important for complex (nonsense
Summary:This article is not so much how to construct a deeper neural network, rather how to help route information in the network, different information should be activated in different layers, not all on the same layer. And unlike the general skip-connection, this cross-layer relationship is not fixed, is obtained through learning, so the adaptability of the problem should be stronger.
"Paper notes" Training Very deep Networks-highway Networks