Highway Networks Pytorch

Last Update:2018-04-24 Source: Internet

Author: User

Tags mul pytorch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Guide

This paper discusses the reasons why deep neural network training is difficult and how to use highway Networks to solve the problem of deep neural network training, and realizes Highway Networks on Pytorch.

I. The relationship between Highway Networks and deep Networks

The deep neural network has better effect compared with the shallow neural network, in many aspects has achieved very good results, especially in the image processing has made a great breakthrough, however, with the increase in depth, the problem of deep neural network is greater, like everyone knows the gradient disappear problem, This also creates difficulties in training deep neural networks. The network structure (Highway Networks), which was inspired by the lstm gate mechanism in 2015 by Rupesh Kumar Srivastava, is a good solution to the problem of training deep neural networks, Highway Networks allows information High-speed unimpeded through the layers of deep neural networks, which effectively slows down the problem of gradients, so that deep neural network is not only the effect of shallow neural network.

Second, deep Networks gradient disappearance/explosion (vanishing and exploding gradient) problem

Let's take a look at the simple Deep Neural network (just a few hidden layers)

Write out the formulas for each layer first.

We have a derivative of W1:

W = W-LR * g (t)

The above formula is only four hidden layers of the case, when the number of hidden layer reached dozens of layers or even hundreds of layers of the case, a layer of reverse propagation back, the power value of < 1 when the reverse propagation to a layer after the weight of the near constant, equivalent to the input x mapping, for example, G (t) =〖0.9〗^ 100 is already very small and small, which causes only the front layer can be normal reverse propagation, the back of those hidden layer is only equivalent to the weight of the input x mapping, weights are not updated. Conversely, when the authority value > 1, it will cause a gradient explosion, the same is only the previous layers can change the normal learning, the hidden layer will become very large.

Third, Highway Networks Formula

Notation
(.) The operation represents a matrix by phase multiplication
sigmoid function:
Highway Networks Formula
For our normal neural network, the input x is converted to y with the nonlinear activation function h, and equation 1 ignores the bias. However, H is not limited to activation functions, but also to other forms, such as convolutional and recurrent.
For the highway networks Neural network, two nonlinear conversion layers are added, one is T (transform gate) and one is C (Carry gate), in layman's terms, T means that the input information is convolutional or the recurrent of the information is converted, C represents the original input information x reserved part, where t=sigmoid (WX + b)
For computing convenience, C = 1-t is defined here.
It is important to note that the dimensions of x, Y, H, T must be the same, and to ensure that the dimensions are consistent, you can adopt sub-sampling either the strategy or the zero-padding normal linear layer to change the dimension to make it consistent.
Compared to a few formulas, Equation 3 is more flexible than Formula 1, you can consider a special case, t= 0, y = x, the original input information is all reserved, do not make any changes, T = 1, y = H, the original information all converted, not preserving the original information, just equivalent to a normal neural network.

Iv. Highway bilstm Networks

Highway bilstm Networks Structure Diagram
is Highway bilstm Networks structure diagram:
Input: vector of words representing input
B: In this task represents Bidirection lstm, representing the H in the formula (2)
T: Represents the T in the formula (2), which is the transform gate in the highway networks
C: Represents the C in the formula (2), which is the carry gate in the highway networks
Layer = N, which represents the nth level in the highway networks
Highway: Box represents a layer of Highway Networks
In this structure diagram, the output of the Highway networks n-1 layer as the input of the nth layer

Highway bilstm Networks Demo
Pytorch build a neural network generally need to inherit nn.Module this class, and then implement the forward() function inside, build highway bilstm Networks wrote two classes, and use nn.ModuleList to associate two classes:


Class Hbilstm (NN. Module):
Init (self, args):
Super (hbilstm, self). Init ()
......
Def forward (self, x):
# Implementation of Highway Bilstm networks formula
......

Class Hbilstm_model (NN. Module): Def __init__ (self, args):    super (Hbilstm_model, self). __init__ () ...    # Args.layer_num_highway represents Highway bilstm networks there are several layers of    Self.highway = nn. Modulelist ([Hbilstm (args) for _ in range (Args.layer_num_highway)])    ... def forward (self, x):    ...    # Call the Forward () function of the Hbilstm class for    Current_layer in Self.highway:        x, Self.hidden = Current_layer (x, Self.hidden)

The HBiLSTM forward() formula we implement in the function of the class Highway BiLSTM Networks
First, let's calculate H, as mentioned above, H can be convolution or lstm, where normal_fc we need h


X, Hidden = self.bilstm (x, Hidden)
# Torch.transpose is a transpose operation
Normal_fc = torch.transpose (x, 0, 1)

As mentioned above, the dimensions of the x,y,h,t must be consistent and provide two strategies, where we use a normal Linear to convert dimension


source_x = Source_x.contiguous ()
Information_source = Source_x.view (source_x.size (0) * source_x.size (1), Source_x.size (2))
Information_source = Self.gate_layer (Information_source)
Information_source = Information_source.view (source_x.size (0), source_x.size (1), information_source.size (1))

You can also usezero-paddingThe policy guarantees that the dimensions are consistent

You also can choose the strategy that zero-padding
Zeros = Torch.zeros (source_x.size (0), source_x.size (1), Carry_layer.size (2)-source_x.size (2))
source_x = Variable (Torch.cat (Zeros, Source_x.data), 2)

After the dimension is consistent, we can write the code according to our formula:

Transformation Gate layer in the formula is T
Transformation_layer = f.sigmoid (Information_source)
Carry gate layer in the formula is CCarry_layer = 1-transformation_layer
Formula Y = H * T + x * CAllow_transformation = Torch.mul (Normal_fc, Transformation_layer)
Allow_carry = Torch.mul (Information_source, Carry_layer)
Information_flow = Torch.add (allow_transformation, Allow_carry)

The finalinformation_flowIs our output, however, it is necessary to ensure that the dimension is consistent by the transformation dimension.
For more information please refer to Github:highway Networks implement in Pytorch

Five, Highway bilstm Networks Experimental results

The task of this experiment is to use Highway bilstm Networks to complete the emotional classification task (the attitude of a sentence is divided into positive or negative), the data comes from the Twitter sentiment classification data set, the following is the number of sentences in the data set of each label:

is the test result of this experiment task in the 2-class data set. Figure 1-300 shows the dimension of layer = 1,bilstm in Highway bilstm Networks is 300 dimensions.

Experimental results: It can be seen that the simple multi-layer bidirectional lstm does not bring emotional analysis performance improvement, especially after the 10 layer, the effect is not as random as guessing. After the use of highway networks, although the performance is gradually declining, but the extent of the decline has been significantly improved.

References

Highway Networks (paper)
Training Very Deep Networks
Why deep neural networks are hard to train
Training Very Deep Networks--highway Networks
Very deep learning with Highway Networks
Hightway Networks Study Notes

Highway Networks Pytorch

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More