0 Monographs
Lstm is a variant of RNN, which belongs to the category of feedback neural networks.
1. Problems of the traditional RNN model: disappearance and eruption of gradients
When it comes to lstm, it's inevitable to first mention the simplest and most primitive rnn.
We can often see people say that lstm is suitable for sequential sequences, variable length sequences, especially for natural language processing. So what gives it the ability to handle variable-length sequences? In fact, as long as the study of the above figure, I believe that everyone can have an intuitive answer.
From the left of the picture, the RNN has two inputs, one is the input XT for the current T-moment, and the other is a seemingly "itself" input.
This is not very clear, and then look at the right side of the picture: In fact, the right is a picture of the left in the time series on the expansion, the last time the output is the input of this moment. It's worth noting that, in fact, all the neurons on the right are the same neuron, the left, which shares the same weights, but accepts different inputs at every moment and then outputs the output to the next moment as input. This is the information stored in the past.
The same RNN model used in this article can be shown as a graph, where net is usually the linear combination of input and weight of each layer prior to activating the function.
Take a look at the more typical bptt (back propgation through time) an unfolded structure, as shown in the following diagram, where only part of the diagram is considered.
The error signal for T-time is calculated as follows:
The derivation formula is as follows:
The formula above is very common in the entire BPTT and even the entire BP network. The concrete derivation is as follows, makes a demonstration:
So if this error signal goes in the past, suppose any two nodes U, v their relationship is as follows:
Then the relation of the error transmission signal can be written as follows:
To continue with the formula above, N represents the number of neurons in a graph, this formula is not difficult to understand, the request from the T-time node U to the t-q moment of the error of a node V, then you need to first find the U pass to t-q+1 all nodes after the error, and then the adjacent two layers before the return (that is, the previous formula).
This is obviously a recursive process, the inside of the link to expand look at:
It looks more complicated, in fact it can be said that:
The preceding q-1 notation is the possible link path from U to V that traverses all the middle layers in the error-return process.
The following multiplication is the error of one of the paths, a total of Q layers, each of which is multiplied by the derivative of an activation function and the weight value.
Let's look at the following equation called T:
The number of times the total result has been summed to T is n^ (Q-1), that is, T has n^ (Q-1)
And each t goes through the multiplication of the coefficients of the Q times:
If the upper > 1, the error will increase with the Q of the exponential growth, then the network parameters update will cause a very large shock.
If the upper < 1, the error will disappear, resulting in learning is invalid, the general activation function with the sigmoid function, its reciprocal maximum value is 0.25, the maximum weight value of less than 4 to ensure that no less than 1.
The phenomenon that the error is exponential growth is relatively few, the error disappears in BPTT is very common.
Here, we should be able to understand that, as a result of each layer of error back, we will introduce a multiplier of the derivative of the activation function, so that after several steps, the multiplication of this multiplier will cause a series of troubles, namely the problem of our so-called gradient disappearance and eruption.
In fact, the analysis of the above equation is very complicated, because we can adjust the result by adjusting the W and increasing the number of nodes in each layer. But in fact these practices are not perfect solution, because W is the parameter that each layer shares, it can affect net value, and net value change can affect the derivative of activation function.
For example, even turning the W into infinity will cause net to be infinitely large, resulting in the derivative of the activation function to be 0.
In fact, the original text has also analyzed these problems, and verified the above two problems, but the more in-depth analysis is a very complex situation, involving higher-order infinitesimal and w adjustment, such as a series of dynamic problems. The author did not continue to speak clearly in the original text.
A more in-depth understanding of this issue can be referred to in this paper,
On the difficulty of training recurrent neural Networks Razvan Pascanu,tomas
This is an in-depth discussion of the nature of the vanishing and the exploding gradient (the complex principles of W matrix eigenvalues), and some of the phenomena they bring.
The gradient disappears and erupts is not the focus of this article, so I'm writing here to summarize, for the common gradient vanishing problem:
The traditional RNN model, in the process of training gradient descent process, more inclined to follow the sequence at the end of the correct direction of the right to update. In other words, the farther the sequence input to the right value of the correct change can play the "impact" the smaller, so the result of training is often biased to the new information, that is not very long memory function.
2.LSTM How to solve the problem
In order to overcome the problem of error disappearance, some restrictions need to be made to assume that only one neuron is connected to itself, as shown in the following diagram:
According to the above, the T-moment error signal is calculated as follows:
In order to make the error not change, you can force the order to be 1:
According to this equation, you can get:
This means that the activation function is linear and often makes FJ (x) = x, WJJ = 1.0, thus obtaining constant error flow, also known as CEC (constant error Carrousel). the design of 3.LSTM model
Up to the top, the simplest structure proposed can be roughly described in such a diagram:
But this is not possible, because there is the input and output of the value of the conflict, so add two control gates, respectively, input gate, output gate, to resolve this contradiction, the figure is as follows:
The figure adds two control gates, the so-called control means to calculate the input of CEC, multiplied by the output of input gate, calculating the output of CEC, multiplying its result by output gate, the whole box is called block, the middle of the small
The circle is CEC, in which a line of y = x indicates that the activation function of the neuron is linear, and the weight of the self connection is 1.0.
the core idea and derivation of 4.LSTM training
So far, we have a general impression that the author is more "willful" to create such a model, in the theory of a constant error called CEC, in this core thought added some gate, here brings another problem, joined the gate after the structure, has been inconsistent with the previous core simple structure, in the training process, the inevitable in a block inside, the error will propagate, dispersion in different gate, resulting in the gradient is dispersed, may still bring the problem of gradient disappear. How to solve.
The author gives the solution: the design and structure of the training algorithm to maintain the integrity of the CEC function. The idea used here is to truncate the gradient return (truncated backprop version of the LSTM algorithm).
The idea of this algorithm is that in order to ensure that the internal error of the memory cell does not decay, all input errors to the block, including (NETCJ, Netinj, NETOUTJ), do not continue to be propagated back to the previous time state.
Describe this process graphically: when an error signal reaches the output of a block (through the hidden layer and the output layer), it is first transformed by the activation function of the output gate and the derivative H ' (scaled), then the value of the CEC state is passed to the previous moment. When this error is to be left at this time by input gate and G, It is again activated by the activation function and the G ' transformation, and then the error at this moment before entering the block before the state of the previous time (in fact, according to the design of the algorithm, this error will not go from here to the next block), and its link to some of the weights to update. The specific implementation process, to carry out the derivation.
(1) Approximation of truncated gradient method
Simply explained, the error returns to Input/output Gate and a block input (NETCJ), does not return error to the last moment, but only to update the various parts of the weight of W.
According to the derivation of the above formula, several other key formulae can be obtained as follows:
The error of each T-moment in the memory cell remains the same.
Finally, the whole process of error return, error through the output layer, classifier, hidden layer, etc. into a certain moment of block, the first error passed to the output gate and memory cell two places.
The error of reaching the output gate is used to update the parameter W of the output gate, after reaching the memory cell, the error passes through two paths,
1 is passing through this cell forward one time or more before,
2 is used to pass to input gate and block inputs to update the corresponding weights (note.) Will not pass through here one moment to pass the error).
One of the most critical and critical questions to repeat is that the return algorithm passes the error only through the intermediate memory cell to the moment before it.
5. A disadvantage of the recent LSTM model is that the state value of CEC may continue to increase, and the state of CEC can be controlled after the addition of Forget gate, which is structured as follows:
Here the equivalent of the connection weight is no longer 1.0, but a dynamic value, this dynamic value is the output value of forget gate, it can control the state of CEC, if necessary to make it 0, that is, forget the role, as 1 o'clock and the original structure.
The next step into our familiar two-year lstm is the improved structure of the big heat:
(Note that the order of the angles is back to normal, formerly to the back)
With the prototype of the 97 lstm presumably, some of the major changes are:
1. The introduction of the Forget gate, some of which are also called Keep gate (in fact, I think it's better).
2. The selection of activation functions has changed, usually sigmoid and Tanh.
3. Joined the peephole, that is, Celll's link to each gate.
4. The training process no longer uses the truncation algorithm, but uses the full bptt+ some trick to carry on the algorithm, and obtains the better effect.
the specific forward and back derivations are as follows: Wij represents the connection weights from neurons I to J (Note that this and many of the essays are reversed) the input of neurons is expressed in a, and the output uses B as subscript ι,φ and ω respectively to represent input gate, forget Gate, Output Gate C subscript represents the cell, from cell to input, forget and output gate of the peephole weights are respectively recorded wcι, Wcφand WCΩSC to represent the cell C State Control Gate activation function is expressed in F, g,h respectively The input-output activation function of the cell represents the number of neurons in the input layer, K is the neuron number of the output layer, and H is the number of hidden cell
Forward calculation:
Error back-transmission update:
(The above formula is taken from Alex's thesis, I am too lazy to take a screenshot from the original text, borrow the figure in the blog mentioned before.) )
The derivation of this formula is much simpler than the original text, which is the repeated application of chain rules, and I think it would be useful to try the derivation of this inversion in the principle. In fact, it's not difficult at all.
study on the working characteristics of 6.LSTM
In fact, so far, most of the lstm models and language models, handwriting recognition, sequence generation, machine translation, voice, video analysis and many other tasks have made some successful applications, but few people to study what is the cause of these situations, and which design is effective.
A series of experiments have been done on the state between the gates and in the alphabet-level language sequences and language models, and the RNN and GRU models have been compared.
Here is a little more interesting, such as the following, this article is the experiment with the "War and Peace" and Linux source code (...). ) Two data sets.
The authors are pleasantly surprised to find that some of the units in Lstm's memory unit have surprising features, such as they may mark the end of a paragraph, or the content of a reference, or some conditional statement in Linux's source code.
Of course, most of them are hidden layers that we can't explain.
"The Forget gates are its most criticalcomponents"
forgotten the door, played the most crucial role. (And there is no forgotten door in the LSTM prototype ...)
In addition, this article also mentions some statistical information about the status of each gate,
The general situation can be expressed in this way, the forget door tends to be inactive (biased 0), that is, often forget things.
Input gate tends to be in a state that is usually open.
Output Gate is open and closed, the basic balance. 7. A number of possible problems
1. Theano lstm Example Chengri gradient return algorithm, due to Theano automatic derivation function, in return should be used is full bptt, this is not going to bring some as before said the problem.
A: I have actually explained some of Alex's trick practices before this question. With careful consideration, the problem of the gradient caused by full bptt is mitigated by the completeness of the memory cell function. In other words, the main error returned by the memory cell and kept it down. So we now use the LSTM model, still have a better effect.
2. Lstm input, Output, forget between the functions of the door is actually duplication. There is no simpler structure to improve.
A: Yes, for example, the GRU (gated recurrent unit), which has already appeared, 8. Summary
Two key issues:
1. Why has the memory function.
This is the problem solved in the RNN, because there is a recursive effect, the state of the hidden layer at the moment to participate in the calculation of this moment, the explicit point of the statement is the selection and decision-making reference to the last state.
2. Why lstm remember the long time.
Because the specially designed structure has the characteristics of CEC, error up a last state when passing almost no attenuation, so the weight adjustment, for a long time before the impact of the State and the effect of the end state can play a role at the same time, the final training of the model has a long time range of memory function. 9.Keras Implementation Keras: The method of using variable-length sequences when constructing the LSTM model
As we all know, one of the great advantages of LSTM is its ability to handle variable-length sequences. When using Keras to build a model, if you use the LSTM layer directly as the first layer of network input, you need to specify the size of the input. If you need to use a variable-length sequence, then simply add a masking layer to the LSTM layer, or embedding layer.