Preface
The sequence problem is also a interesting issue. Looking for a meeting LSTM
of the material, found not a system of text, the early Sepp Hochreiter
paper and disciple Felix Gers
's thesis did not look so relaxed. The first thing to start with was a review in 15, and it didn't look very smooth at the time, but looking at the first two (part) and then looking back at the formulation part of the article would be clearer.
Originally intended to write a program of their own, found here a reference, the program is very short,Python wrote a total of no 200line, but to get the structure from the inside to some difficulty. Think of mxnet inside like some example (example/bi-lstm-sort), find out to see. The LSTM basic unit is constructed with symbol and then optimized with the bucket feature. Feel good, incidentally can see how to use the bucket .
Code Plus Comment
This program uses symbol to build a memory unit, and then use it to build a complete symbol, previously thought to be a built-in symbol, but found mxnet-v1.0 version of LSTM Unit built-in symbols are still in the dev phase, so it's interesting to know how to do timing correlation.
######## from example/bi-lstm-sort/lstm.py #############defLstm (Num_hidden, Indata, Prev_state, param, Seqidx, layeridx, dropout=0.):# Build a unit "" " LSTM Cell symbol" "" ifDropout> 0.: Indata=Mx.sym.Dropout (data=Indata, p=Dropout) i2h=mx.sym.FullyConnected (data=Indata, Weight=Param.i2h_weight, bias=Param.i2h_bias, Num_hidden=Num_hidden* 4, name="T%d_l%d_i2h " %(Seqidx, LAYERIDX)) h2h=mx.sym.FullyConnected (data=Prev_state.h, Weight=Param.h2h_weight, bias=Param.h2h_bias, Num_hidden=Num_hidden* 4, name="T%d_l%d_h2h " %(Seqidx, LAYERIDX)) Gates=i2h+H2H Slice_gates=Mx.sym.SliceChannel (Gates, num_outputs=4, name="T%d_l%d_slice " %(Seqidx, LAYERIDX)) In_gate=Mx.sym.Activation (slice_gates[0], Act_type="Sigmoid") In_transform=Mx.sym.Activation (slice_gates[1], Act_type="Tanh") forget_gate=Mx.sym.Activation (slice_gates[2], Act_type="Sigmoid") out_gate=Mx.sym.Activation (slice_gates[3], Act_type="Sigmoid") Next_c=(forget_gate*PREV_STATE.C)+(in_gate*In_transform) Next_h=Out_gate*Mx.sym.Activation (Next_c, Act_type="Tanh")returnLstmstate (c=Next_c, H=Next_h)defBi_lstm_unroll (Seq_len, Input_size, Num_hidden, num_embed, Num_label, dropout=0.): Embed_weight=Mx.sym.Variable ("Embed_weight") Cls_weight=Mx.sym.Variable ("Cls_weight") Cls_bias=Mx.sym.Variable ("Cls_bias") last_states=[] Last_states.append (Lstmstate (c=Mx.sym.Variable ("L0_init_c"), H=Mx.sym.Variable ("L0_init_h"))) Last_states.append (Lstmstate (c=Mx.sym.Variable ("L1_init_c"), H=Mx.sym.Variable ("L1_init_h")) Forward_param=Lstmparam (i2h_weight=Mx.sym.Variable ("L0_i2h_weight"), I2h_bias=Mx.sym.Variable ("C"), H2h_weight=Mx.sym.Variable ("L0_h2h_weight"), H2h_bias=Mx.sym.Variable ("L0_h2h_bias")) Backward_param=Lstmparam (i2h_weight=Mx.sym.Variable ("L1_i2h_weight"), I2h_bias=Mx.sym.Variable ("L1_i2h_bias"), H2h_weight=Mx.sym.Variable ("L1_h2h_weight"), H2h_bias=Mx.sym.Variable ("L1_h2h_bias"))# embeding LayerData=Mx.sym.Variable (' Data ') label=Mx.sym.Variable (' Softmax_label ') Embed=Mx.sym.Embedding (data=Data, Input_dim=Input_size, Weight=Embed_weight, Output_dim=num_embed, Name=' embed ') Wordvec=Mx.sym.SliceChannel (data=Embed, num_outputs=Seq_len, Squeeze_axis=1) Forward_hidden=[] forSeqidxinch Range(Seq_len): Hidden=WORDVEC[SEQIDX] Next_State=Lstm (Num_hidden, Indata=Hidden, prev_state=last_states[0], param=Forward_param, Seqidx=Seqidx, Layeridx=0, dropout=Dropout) Hidden=Next_state.h last_states[0]=Next_State forward_hidden.append (Hidden) Backward_hidden=[]# from the folder name can be seen, this is a two-way symbol, so there will be backward part (just a guess:)) forSeqidxinch Range(Seq_len): K=Seq_len-Seqidx- 1Hidden=WORDVEC[K] Next_State=Lstm (Num_hidden, Indata=Hidden, prev_state=last_states[1], param=Backward_param, Seqidx=K, Layeridx=1, dropout=Dropout) Hidden=Next_state.h last_states[1]=Next_State Backward_hidden.insert (0, hidden) Hidden_all=[] forIinch Range(Seq_len): Hidden_all.append (Mx.sym.Concat (*[Forward_hidden[i], Backward_hidden[i]], Dim=1)) Hidden_concat=Mx.sym.Concat (*Hidden_all, Dim=0) pred=mx.sym.FullyConnected (data=Hidden_concat, Num_hidden=Num_label, Weight=Cls_weight, bias=Cls_bias, Name=' pred ') label=Mx.sym.transpose (data=Label) label=Mx.sym.Reshape (data=Label, Target_shape=(0,)) SM=Mx.sym.SoftmaxOutput (data=pred, label=Label, name=' Softmax ')returnSm
Embedding Op
Before turning to the API, I saw this symbol, at that time, although understand what can be achieved (although it is wrong: think only to achieve the cable/coding), but unexpectedly this function can do what-_-| | Now it's over, so check it out in passing. The API does not enclose a corresponding document explaining what this implementation refers to, and had to search everywhere. The answer should be able to help. To tidy up, that is, to encode 非one-hot
the Word table (to eliminate the independence of each dimension, the value of each dimension is continuous).
Note
- Another problem with this is how to optimize it? This behind the paper look at the situation (just guess this layer should be implemented update).
- Another question is, how do I decode the model output phase without using one-hot encoding? From a procedural point of view, in the output phase, thesoftmax output will be considered one-hot encoded to avoid this problem.
Time Dependency
In this example, only one layer of memory is used (but according to the paper, even if there is only one unit, the body size is very large). The construction of the complete symbol is bi_lstm_unroll
carried out in the inside, and its timing dependencies are set up in the following scenarios. A single complete sequence of input (vector sequence) is SliceChannel
separated into a single vector, and then a complete symbol is constructed according to the number of separated vectors, and since the number of vectors is known at this time, memory units can be stacked continuously until each vector is assigned to the corresponding processing unit. The parameters used for each cell are assigned to the same group (l0_i2h_weight, L0_i2h_bias, etc). This results in a cyclic calculation of the effect.
Note
- Another problem arises here is how to deal with variable-length input sequence problems. This should be related to the bucket mechanism, look at the back time to see. But you can guess the bucket to solve the problem, from the Great God's blog, the bucket mechanism to each set a good length binding to generate a model, and because these lengths are discrete, may also be done to complement the operation, if the completion then also to deal with the resulting training update problem.
LSTM implementation
Let's look at how to build a memory unit, and build it out over a period of time, maybe this example is likercnnSee the details of the tear (well, at least not so easy).
lstm
Function inside is a Critical Review of recurrent neural Networks for Sequence learningPage-20On the formula, notFelix Gers
ThesisPAGE-17OnFigure 3.1Description of the form, on which the former has a paragraph note on that page:These equations give the full algorithm for a modern LSTM ...
。 I'm bashi to play it again ...
\[\begin{eqnarray}g^{(t)} &=& \phi (w^{gx}x^{(t)} + w^{gh}h^{(t-1)} + b_g) \nonumber\i^{(t)} &=& \ Phi (w^{ix}x^{(t)} + w^{ih}h^{(t-1)} + b_i) \nonumber\f^{(t)} &=& \phi (w^{fx}x^{(t)} + w^{fh}h^{(t-1)} + b_f) \nonu mber\o^{(t)} &=& \phi (w^{ox}x^{(t)} + w^{oh}h^{(t-1)} + b_o) \nonumber\s^{(t)} &=& g^{(t)}\odot i^{(t)} + s^{(t-1)}\odot f^{(t)}\nonumber\h^{(t)} &=& \phi (s^{(t)}) \odot o^{(t)}\nonumber\end{eqnarray}\]
It can be observed that the input variables of each nonlinear map are the same (\ (x^{(t)},~h^{(t-1)}\)), corresponding tolstm
function inside,i2h
Andh2h
is added directly, and then divided into the correspondinggate
Parameters.
Graph
After all this, let's see what the final graph is (some big, right-click to view it separately):
Figure 1. Graph of the
LSTM for 5-length input
It can be observed that the bottom part is in addition
data
Nodes, there are also cyan nodes, which cannot be initialized by this naming method,
sort_io.pyThe parameters are provided for these nodes.
Note
Finally attach a note, although the program is sort
named, but from the content, such training is to each number as a word input, that is, the test when the number of input sequence must be in training (not strictly verified, guess)
A note on lstm examples in mxnet