Guide
This paper discusses the reasons why deep neural network training is difficult and how to use highway Networks to solve the problem of deep neural network training, and realizes Highway Networks on Pytorch.
I. The relationship between Highway Networks and deep Networks
The deep neural network has better effect compared with the shallow neural network, in many aspects has achieved very good results, especially in the image processing has made a great breakthrough, however, with the increase in depth, the problem of deep neural network is greater, like everyone knows the gradient disappear problem, This also creates difficulties in training deep neural networks. The network structure (Highway Networks), which was inspired by the lstm gate mechanism in 2015 by Rupesh Kumar Srivastava, is a good solution to the problem of training deep neural networks, Highway Networks allows information High-speed unimpeded through the layers of deep neural networks, which effectively slows down the problem of gradients, so that deep neural network is not only the effect of shallow neural network.
Second, deep Networks gradient disappearance/explosion (vanishing and exploding gradient) problem
Let's take a look at the simple Deep Neural network (just a few hidden layers)
Write out the formulas for each layer first.
We have a derivative of W1:
W = W-LR * g (t)
The above formula is only four hidden layers of the case, when the number of hidden layer reached dozens of layers or even hundreds of layers of the case, a layer of reverse propagation back, the power value of < 1 when the reverse propagation to a layer after the weight of the near constant, equivalent to the input x mapping, for example, G (t) =〖0.9〗^ 100 is already very small and small, which causes only the front layer can be normal reverse propagation, the back of those hidden layer is only equivalent to the weight of the input x mapping, weights are not updated. Conversely, when the authority value > 1, it will cause a gradient explosion, the same is only the previous layers can change the normal learning, the hidden layer will become very large.
Third, Highway Networks Formula
Notation
(.) The operation represents a matrix by phase multiplication
sigmoid function:
Highway Networks Formula
For our normal neural network, the input x is converted to y with the nonlinear activation function h, and equation 1 ignores the bias. However, H is not limited to activation functions, but also to other forms, such as convolutional and recurrent.
For the highway networks Neural network, two nonlinear conversion layers are added, one is T (transform gate) and one is C (Carry gate), in layman's terms, T means that the input information is convolutional or the recurrent of the information is converted, C represents the original input information x reserved part, where t=sigmoid (WX + b)
For computing convenience, C = 1-t is defined here.
It is important to note that the dimensions of x, Y, H, T must be the same, and to ensure that the dimensions are consistent, you can adopt sub-sampling
either the strategy or the zero-padding
normal linear layer to change the dimension to make it consistent.
Compared to a few formulas, Equation 3 is more flexible than Formula 1, you can consider a special case, t= 0, y = x, the original input information is all reserved, do not make any changes, T = 1, y = H, the original information all converted, not preserving the original information, just equivalent to a normal neural network.
Iv. Highway bilstm Networks
Highway bilstm Networks Structure Diagram
is Highway bilstm Networks structure diagram:
Input: vector of words representing input
B: In this task represents Bidirection lstm, representing the H in the formula (2)
T: Represents the T in the formula (2), which is the transform gate in the highway networks
C: Represents the C in the formula (2), which is the carry gate in the highway networks
Layer = N, which represents the nth level in the highway networks
Highway: Box represents a layer of Highway Networks
In this structure diagram, the output of the Highway networks n-1 layer as the input of the nth layer
Highway bilstm Networks Demo
Pytorch build a neural network generally need to inherit nn.Module
this class, and then implement the forward()
function inside, build highway bilstm Networks wrote two classes, and use nn.ModuleList
to associate two classes:
Class Hbilstm (NN. Module):
Init (self, args):
Super (hbilstm, self). Init ()
......
Def forward (self, x):
# Implementation of Highway Bilstm networks formula
......
Class Hbilstm_model (NN. Module): Def __init__ (self, args): super (Hbilstm_model, self). __init__ () ... # Args.layer_num_highway represents Highway bilstm networks there are several layers of Self.highway = nn. Modulelist ([Hbilstm (args) for _ in range (Args.layer_num_highway)]) ... def forward (self, x): ... # Call the Forward () function of the Hbilstm class for Current_layer in Self.highway: x, Self.hidden = Current_layer (x, Self.hidden)
The HBiLSTM
forward()
formula we implement in the function of the class Highway BiLSTM Networks
First, let's calculate H, as mentioned above, H can be convolution or lstm, where normal_fc
we need h
X, Hidden = self.bilstm (x, Hidden)
# Torch.transpose is a transpose operation
Normal_fc = torch.transpose (x, 0, 1)
As mentioned above, the dimensions of the x,y,h,t must be consistent and provide two strategies, where we use a normal Linear
to convert dimension
source_x = Source_x.contiguous ()
Information_source = Source_x.view (source_x.size (0) * source_x.size (1), Source_x.size (2))
Information_source = Self.gate_layer (Information_source)
Information_source = Information_source.view (source_x.size (0), source_x.size (1), information_source.size (1))
You can also usezero-padding
The policy guarantees that the dimensions are consistent
You also can choose the strategy that zero-padding
Zeros = Torch.zeros (source_x.size (0), source_x.size (1), Carry_layer.size (2)-source_x.size (2))
source_x = Variable (Torch.cat (Zeros, Source_x.data), 2)
After the dimension is consistent, we can write the code according to our formula:
Transformation Gate layer in the formula is T
Transformation_layer = f.sigmoid (Information_source)
Carry gate layer in the formula is CCarry_layer = 1-transformation_layer
Formula Y = H * T + x * CAllow_transformation = Torch.mul (Normal_fc, Transformation_layer)
Allow_carry = Torch.mul (Information_source, Carry_layer)
Information_flow = Torch.add (allow_transformation, Allow_carry)
The finalinformation_flow
Is our output, however, it is necessary to ensure that the dimension is consistent by the transformation dimension.
For more information please refer to Github:highway Networks implement in Pytorch
Five, Highway bilstm Networks Experimental results
The task of this experiment is to use Highway bilstm Networks to complete the emotional classification task (the attitude of a sentence is divided into positive or negative), the data comes from the Twitter sentiment classification data set, the following is the number of sentences in the data set of each label:
is the test result of this experiment task in the 2-class data set. Figure 1-300 shows the dimension of layer = 1,bilstm in Highway bilstm Networks is 300 dimensions.
Experimental results: It can be seen that the simple multi-layer bidirectional lstm does not bring emotional analysis performance improvement, especially after the 10 layer, the effect is not as random as guessing. After the use of highway networks, although the performance is gradually declining, but the extent of the decline has been significantly improved.
References
Highway Networks (paper)
Training Very Deep Networks
Why deep neural networks are hard to train
Training Very Deep Networks--highway Networks
Very deep learning with Highway Networks
Hightway Networks Study Notes
Highway Networks Pytorch