A note on lstm examples in mxnet

Last Update:2017-12-17 Source: Internet

Author: User

Tags mxnet

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

The sequence problem is also a interesting issue. Looking for a meeting LSTM of the material, found not a system of text, the early Sepp Hochreiter paper and disciple Felix Gers 's thesis did not look so relaxed. The first thing to start with was a review in 15, and it didn't look very smooth at the time, but looking at the first two (part) and then looking back at the formulation part of the article would be clearer.
Originally intended to write a program of their own, found here a reference, the program is very short,Python wrote a total of no 200line, but to get the structure from the inside to some difficulty. Think of mxnet inside like some example (example/bi-lstm-sort), find out to see. The LSTM basic unit is constructed with symbol and then optimized with the bucket feature. Feel good, incidentally can see how to use the bucket .

Code Plus Comment

This program uses symbol to build a memory unit, and then use it to build a complete symbol, previously thought to be a built-in symbol, but found mxnet-v1.0 version of LSTM Unit built-in symbols are still in the dev phase, so it's interesting to know how to do timing correlation.

######## from example/bi-lstm-sort/lstm.py #############defLstm (Num_hidden, Indata, Prev_state, param, Seqidx, layeridx, dropout=0.):# Build a unit    "" " LSTM Cell symbol" ""     ifDropout> 0.: Indata=Mx.sym.Dropout (data=Indata, p=Dropout) i2h=mx.sym.FullyConnected (data=Indata, Weight=Param.i2h_weight, bias=Param.i2h_bias, Num_hidden=Num_hidden* 4, name="T%d_l%d_i2h " %(Seqidx, LAYERIDX)) h2h=mx.sym.FullyConnected (data=Prev_state.h, Weight=Param.h2h_weight, bias=Param.h2h_bias, Num_hidden=Num_hidden* 4, name="T%d_l%d_h2h " %(Seqidx, LAYERIDX)) Gates=i2h+H2H Slice_gates=Mx.sym.SliceChannel (Gates, num_outputs=4, name="T%d_l%d_slice " %(Seqidx, LAYERIDX)) In_gate=Mx.sym.Activation (slice_gates[0], Act_type="Sigmoid") In_transform=Mx.sym.Activation (slice_gates[1], Act_type="Tanh") forget_gate=Mx.sym.Activation (slice_gates[2], Act_type="Sigmoid") out_gate=Mx.sym.Activation (slice_gates[3], Act_type="Sigmoid") Next_c=(forget_gate*PREV_STATE.C)+(in_gate*In_transform) Next_h=Out_gate*Mx.sym.Activation (Next_c, Act_type="Tanh")returnLstmstate (c=Next_c, H=Next_h)defBi_lstm_unroll (Seq_len, Input_size, Num_hidden, num_embed, Num_label, dropout=0.): Embed_weight=Mx.sym.Variable ("Embed_weight") Cls_weight=Mx.sym.Variable ("Cls_weight") Cls_bias=Mx.sym.Variable ("Cls_bias") last_states=[] Last_states.append (Lstmstate (c=Mx.sym.Variable ("L0_init_c"), H=Mx.sym.Variable ("L0_init_h"))) Last_states.append (Lstmstate (c=Mx.sym.Variable ("L1_init_c"), H=Mx.sym.Variable ("L1_init_h")) Forward_param=Lstmparam (i2h_weight=Mx.sym.Variable ("L0_i2h_weight"), I2h_bias=Mx.sym.Variable ("C"), H2h_weight=Mx.sym.Variable ("L0_h2h_weight"), H2h_bias=Mx.sym.Variable ("L0_h2h_bias")) Backward_param=Lstmparam (i2h_weight=Mx.sym.Variable ("L1_i2h_weight"), I2h_bias=Mx.sym.Variable ("L1_i2h_bias"), H2h_weight=Mx.sym.Variable ("L1_h2h_weight"), H2h_bias=Mx.sym.Variable ("L1_h2h_bias"))# embeding LayerData=Mx.sym.Variable (' Data ') label=Mx.sym.Variable (' Softmax_label ') Embed=Mx.sym.Embedding (data=Data, Input_dim=Input_size, Weight=Embed_weight, Output_dim=num_embed, Name=' embed ') Wordvec=Mx.sym.SliceChannel (data=Embed, num_outputs=Seq_len, Squeeze_axis=1) Forward_hidden=[] forSeqidxinch Range(Seq_len): Hidden=WORDVEC[SEQIDX] Next_State=Lstm (Num_hidden, Indata=Hidden, prev_state=last_states[0], param=Forward_param, Seqidx=Seqidx, Layeridx=0, dropout=Dropout) Hidden=Next_state.h last_states[0]=Next_State forward_hidden.append (Hidden) Backward_hidden=[]# from the folder name can be seen, this is a two-way symbol, so there will be backward part (just a guess:))     forSeqidxinch Range(Seq_len): K=Seq_len-Seqidx- 1Hidden=WORDVEC[K] Next_State=Lstm (Num_hidden, Indata=Hidden, prev_state=last_states[1], param=Backward_param, Seqidx=K, Layeridx=1, dropout=Dropout) Hidden=Next_state.h last_states[1]=Next_State Backward_hidden.insert (0, hidden) Hidden_all=[] forIinch Range(Seq_len): Hidden_all.append (Mx.sym.Concat (*[Forward_hidden[i], Backward_hidden[i]], Dim=1)) Hidden_concat=Mx.sym.Concat (*Hidden_all, Dim=0) pred=mx.sym.FullyConnected (data=Hidden_concat, Num_hidden=Num_label, Weight=Cls_weight, bias=Cls_bias, Name=' pred ') label=Mx.sym.transpose (data=Label) label=Mx.sym.Reshape (data=Label, Target_shape=(0,)) SM=Mx.sym.SoftmaxOutput (data=pred, label=Label, name=' Softmax ')returnSm

Embedding Op

Before turning to the API, I saw this symbol, at that time, although understand what can be achieved (although it is wrong: think only to achieve the cable/coding), but unexpectedly this function can do what-_-| | Now it's over, so check it out in passing. The API does not enclose a corresponding document explaining what this implementation refers to, and had to search everywhere. The answer should be able to help. To tidy up, that is, to encode 非one-hot the Word table (to eliminate the independence of each dimension, the value of each dimension is continuous).

Note

Another problem with this is how to optimize it? This behind the paper look at the situation (just guess this layer should be implemented update).
Another question is, how do I decode the model output phase without using one-hot encoding? From a procedural point of view, in the output phase, thesoftmax output will be considered one-hot encoded to avoid this problem.

Time Dependency

In this example, only one layer of memory is used (but according to the paper, even if there is only one unit, the body size is very large). The construction of the complete symbol is bi_lstm_unroll carried out in the inside, and its timing dependencies are set up in the following scenarios. A single complete sequence of input (vector sequence) is SliceChannel separated into a single vector, and then a complete symbol is constructed according to the number of separated vectors, and since the number of vectors is known at this time, memory units can be stacked continuously until each vector is assigned to the corresponding processing unit. The parameters used for each cell are assigned to the same group (l0_i2h_weight, L0_i2h_bias, etc). This results in a cyclic calculation of the effect.

Note

Another problem arises here is how to deal with variable-length input sequence problems. This should be related to the bucket mechanism, look at the back time to see. But you can guess the bucket to solve the problem, from the Great God's blog, the bucket mechanism to each set a good length binding to generate a model, and because these lengths are discrete, may also be done to complement the operation, if the completion then also to deal with the resulting training update problem.

LSTM implementation

Let's look at how to build a memory unit, and build it out over a period of time, maybe this example is likercnnSee the details of the tear (well, at least not so easy).
lstmFunction inside is a Critical Review of recurrent neural Networks for Sequence learningPage-20On the formula, notFelix GersThesisPAGE-17OnFigure 3.1Description of the form, on which the former has a paragraph note on that page:These equations give the full algorithm for a modern LSTM ...。 I'm bashi to play it again ...
\[\begin{eqnarray}g^{(t)} &=& \phi (w^{gx}x^{(t)} + w^{gh}h^{(t-1)} + b_g) \nonumber\i^{(t)} &=& \ Phi (w^{ix}x^{(t)} + w^{ih}h^{(t-1)} + b_i) \nonumber\f^{(t)} &=& \phi (w^{fx}x^{(t)} + w^{fh}h^{(t-1)} + b_f) \nonu mber\o^{(t)} &=& \phi (w^{ox}x^{(t)} + w^{oh}h^{(t-1)} + b_o) \nonumber\s^{(t)} &=& g^{(t)}\odot i^{(t)} + s^{(t-1)}\odot f^{(t)}\nonumber\h^{(t)} &=& \phi (s^{(t)}) \odot o^{(t)}\nonumber\end{eqnarray}\]
It can be observed that the input variables of each nonlinear map are the same (\ (x^{(t)},~h^{(t-1)}\)), corresponding tolstmfunction inside,i2hAndh2his added directly, and then divided into the correspondinggateParameters.

Graph

After all this, let's see what the final graph is (some big, right-click to view it separately):

Figure 1. Graph of the LSTM for 5-length input

It can be observed that the bottom part is in addition dataNodes, there are also cyan nodes, which cannot be initialized by this naming method, sort_io.pyThe parameters are provided for these nodes.

Note

Finally attach a note, although the program is sort named, but from the content, such training is to each number as a word input, that is, the test when the number of input sequence must be in training (not strictly verified, guess)

A note on lstm examples in mxnet

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More