LSTM Principle Analysis

Source: Internet
Author: User
Tags constant join truncated theano
A summary of lstm theory deduction

Catalogue

1. The problem of traditional RNN: the disappearance and eruption of gradients

2. Lstm the solution to the problem

3. LSTM design of the model

4. Core ideas and derivation of lstm training

5. Recent improvements to the LSTM model

6. Study on the working characteristics of LSTM

7. Some of the problems that may exist

8. Summary

9. References

1. Problems with traditional RNN models: the disappearance and eruption of gradients
The same RNN model used in this article can be pictured, where net is usually the linear combination of input and weight of each layer before activating the function.
Note: In the original text of LSTM (97), and in the corresponding large number of documents, the order of the corners is the opposite of what we usually write now. Such as: Wij represents from J to I.


The following derivation is mainly derived from the author's thesis "The Vanishing GRADIENT problem DURING recurrent neural networks and problem solutions" lstm

And the author also has the same content in the first half of the paper lstm.


Take a look at the more typical bptt (back propgation through Time) an expanded structure, as shown below, with only a partial view.


The error signal for the T moment is calculated as follows:

The derivation formula is as follows:



The above formula is very common throughout the BPTT and even the entire BP network. The concrete deduction is as follows, make a demonstration:



So if this error signal goes all the way to the past, assuming any two nodes U, v their relationship is as follows:



Then the relationship between the error-transmitting signal can be written as follows:



(Here is a question is a senior to me, the most front of the differential symbol is not necessary at all.) According to the previous two-layer derivation, it does not seem to use differential, only the ratio can be.

I think about this question, in fact, because of the derivation of the full differential, leading to the relationship between the various differential, no matter how the transmission, always the linear combination between each other, so write not to write the differential symbol, it seems to be exactly the same. )


Continue to say that the above formula, n represents the number of neurons in the graph, this formula is not difficult to understand, requires from the T moment a node U pass to the t-q moment of the error of a node V, then we need to first find out that you pass to the t-q+1 moment all the errors of the node, then the next two layers before the return (that is,

This is obviously a recursive process, and the inside of the connection is expanded to look at:



It looks more complicated, in fact it can be said:

The preceding q-1 symbol is the possible link-up path from u to v traversing all the middle layers during the error callback process.

The subsequent multiplication is the error representation of one of the paths, having undergone a total of Q layers, each of which is multiplied by the derivative and weight of an activation function.


Unfold to see, the back of this multiplication of the formula is called T words:



The number of times the entire result is summed to T is n^ (Q-1), that is, T has n^ (Q-1)

and each t passes the multiplication of coefficients such as Q times:


If the > 1, the error will increase exponentially with Q, then the network parameter update will cause very large shocks.

If the < 1, the error will disappear, resulting in invalid learning, the general activation function with the sigmoid function, its countdown to the maximum value is 0.25, the maximum weight value of less than 4 to ensure that no less than 1.

The exponential growth of the error is relatively small, and the error disappears in the BPTT is very common.


Here, we should probably understand that, because after each layer of error inversion, will introduce a multiplier of the derivative of the activation function, so after many steps, the multiplication of this multiplier will lead to a series of problems, that is, our so-called gradient disappears and explode problem.


In fact, the analysis of the above results is very complicated, because we can adjust W and increase the number of nodes per layer N to adjust the results. But in fact none of these practices can be solved perfectly, because W is a parameter shared by each layer, it affects the value of net, and the change in net value affects the derivative of the activation function.

For example, even if w becomes infinite, the net will be infinitely large, resulting in the derivative of the activation function being 0.

In fact, the original text also analyzes these problems, and validates the two problems mentioned above, but more in-depth analysis is a very complex situation, involving high-order infinitesimal and W regulation and other dynamic problems. The author did not continue to speak clearly in the original text.

More in-depth understanding of this issue can be referred to this paper,

"On the difficulty of training recurrent neural Networks" Razvan Pascanu,tomas Mikolov,yoshua Bengio

This is an in-depth discussion of the nature of the vanishing and the exploding gradient (complex principles such as the W Matrix eigenvalues), and some of the phenomena they bring.



Gradient disappearance and eruption is not the focus of this article, so I am writing here for the time being, to summarize, for the common gradient vanishing problem:


The traditional RNN model tends to be updated in the correct direction of the weights at the end of the sequence in the course of the gradient descent during the training process. In other words, the farther the sequence input of the right value of the correct change in the "impact" of the smaller, so the result of training is often biased towards the new information, that is, not very long memory function.

2.LSTM How to solve the problem

In order to overcome the problem of error disappearance, some restrictions need to be made, assuming that only one neuron is connected to itself, the diagram is as follows:


According to the above, the error signal of the T moment is calculated as follows:


To make the error not change, you can force the following formula to 1:


According to this formula, you can get:


This means that the activation function is linear, often fj (x) = x, WJJ = 1.0, so that constant error flow is obtained, also called CEC (constant error Carrousel). 3.LSTM Design of the model

Up to this point, the simplest structure can actually be expressed in such a diagram:


But this is not the case, because there is the input and output of the right value update (the original author's words, the end of this paragraph after some explanation) conflict, so add two control doors, respectively, input gate, output gate, to resolve this contradiction, the figure is as follows:


The figure adds two control gates, the so-called control means to calculate the CEC input, multiplied by the output of input gate, calculate the output of the CEC, multiply the result of the output gate, the whole box is called block, the middle of the small

The circle is CEC, which is a line of y = x that indicates that the neuron's activation function is linear and the self-connected weight is 1.0.

The input and output weights are contradictory, as described in the author's original text:



To tell the truth, only in this paragraph, I read not less than dozens of times, with a few senior also discussed many times, still did not find a good explanation to reach a consensus.

From my own perspective, the motive for the gate, I would simply like to elaborate:


(1) to join the door, in fact, is a multi-level feature of the choice way.

To cite an inappropriate example, for example, I go to interview Baidu, first I may need to go through a technical side to see if I have the appropriate technical qualities to enter the company.

After passing, I went through a manpower side to see if my other qualities were fulfilled. Of course, we only need one interview at a time to test both my technical and human qualities.

However, this has put forward higher requirements for the interviewer, the interviewer has to know the technology, but also have a good understanding of manpower. Obviously this comprehensive quality is difficult for us to meet, and with two different levels of interview, we are relatively easy to do.


The technical side we can analogy into the diagram of the NETCJ, human face we can be analogous to the input gate.

We can certainly get the information we want before entering the cell core unit with a set of W.

But this difficulty is very large, want to train out this w is more than two levels to screen more difficult.


(2) Each gate function is somewhat repetitive, the author also mentions that the input and output gate sometimes does not need to appear simultaneously, can use only one of them.

the core idea and derivation of 4.LSTM training

To this end, we generally have this impression, the author is more "wayward" to create such a model, in theory, firmly determined a constant error stream called CEC, in this core thought to join a number of gate, here brings another problem, added to the structure behind the gate, has been inconsistent with the previous core simple structure, in the course of training, inevitably in a block inside, the error will spread, dispersion action on different gate, resulting in the gradient is dispersed, may still lead to the problem of gradient disappear. How to resolve.

The solution proposed by the authors is to design a training algorithm that adapts to the structure to maintain the integrity of the CEC function. The idea here is to truncate the gradient callbacks (truncated backprop version of the LSTM algorithm).

The idea of this algorithm is that in order to ensure that the error inside the memory cell does not decay, all input errors that arrive at the block, including (NETCJ, Netinj, NETOUTJ), do not continue to reverse-propagate toward the previous time state.

Visualize this process: when an error signal reaches the output of a block (through the return of the hidden layer and the output layer), it is first transformed by the activation function and the derivative h ' of the output gate (scaled), and then becomes the value of the CEC state to pass before the time, When this error passes through input gate and g away from the block at some point, It goes through the activation function and the G ' transformation again, and then the error at the moment goes before the block in the previous time state (in fact, according to the design of the algorithm, this error does not go from here to the next block), and some of the weights linked to it are updated. Concrete implementation of the process, to carry out a deduction.

Here's a little detail to note that the H and G used in the original are not now widely used tanh and sigmoid, but some variants of sigmoid, such as:


Personal experience:

For some cells with recursive structures, it is better to use positive and negative activation functions.

After that, a detailed derivation of the truncated callback algorithm for LSTM is given, which is described in the appendix of the original.


PS: This part of the derivation, I in two reports, are in accordance with the lstm of the original appendix of the derivation of the full explanation, the formula is very much in the middle (40), so here do not push. The main important formulas are listed here:


(1) Approximation of truncated gradient method


As a simple explanation, the error is passed back to the Input/output gate and a block input (NETCJ), and the error is not returned to the previous moment, but only used to update the weights of each part of W.


According to the derivation of the above formula, several other key formulae can be obtained as follows:


The error at each T moment in the memory cell remains the same.




Finally, the whole process of combing the error back, error through the output layer, classifier, hidden layer, etc. into a certain time block, the error is passed to the output gate and memory cell two places.

The error that reaches the output gate is used to update the parameter W of the output gate, and after reaching the memory cell, the error passes through two paths.

1 is passed through this cell forward a moment or more before the moment,

2 is used to pass inputs to input gate and block to update the corresponding weights (note. Will not pass here a moment before passing the error).

One of the most critical and critical questions to repeat is that this callback algorithm passes the error only through the middle memory cell to the previous moment.

5. Recent improvements to the LSTM model are followed by an improved structure of our familiar two-year-old hot lstm:


(Note that the order of the corners here returns to normal, from front to back)

With the 97 LSTM prototype presumably, the main changes were several:

1. The introduction of the Forget gate, some of the literature is also called Keep gate (in fact, I feel better).

2. The selection of activation functions has changed, usually sigmoid and Tanh.

3. Added peephole, which is the link of celll to each gate.

4. The training process no longer uses the truncation algorithm, but uses the full bptt+ some trick to perform the algorithm, and obtains the better effect.

(The original LSTM algorithm used a custom designed approximategradient calculation that allowed the weights to be updated After Everytimestep. However the full gradient can instead is calculated withbackpropagation through time, the method used in this paper. One Difficultywhen training LSTM

With the full gradient are that the derivatives sometimes becomeexcessively large and leading to numerical problems. To prevent this, allthe experiments in this paper clipped the derivative of the loss with Respectto the network inputs to The LSTM layers (before the sigmoid and Tanh Functionsare applied) to lie within a predefined range)

Some of the new models that can be consulted include:

A. Alex Graves. Supervised Sequence labelling with recurrent neural Networks. Textbook, Studies incomputational Intelligence, Springer, 2012.

B. Alex Graves, generating sequences with RNN

(Ps:alex Graves is a student of Hinton, his main work in the previous years is to RNN and lstm research and application, can be said to be the most top experts in lstm. )


the specific forward and back derivation is as follows: the WIJ represents the connection weights from the neuron I to J (Note that this and many of the paper's representations are reversed) the input of the neuron is expressed in a, the output with B means subscript ι,φ and ω respectively for the input gate, forget Gate, The output Gate C subscript indicates the cell, from cell to input, forget and output gate peephole weights are wcι, Wcφand WCΩSC means that the activation function of cell C's state control gate is denoted by F, g,h respectively Indicates the input and output activation function I of the cell indicates the number of neurons in the input layer, K is the number of cells in the output layer, and H is the number of hidden cells

Calculation of the forward direction:


Error back-pass update:



(The above formula is taken from Alex's paper, I am too lazy to screenshot from the original text, borrowed from the previous mentioned blog diagram.) )

The derivation of this formula, more simple than the original text, it is said that the chain law of repeated application, I think in the principle of more in-depth words of this anti-transmission of the deduction may try. In fact, it's not difficult.

Refer to the previously mentioned blog for details.

study on the working characteristics of 6.LSTM

In fact, so far, most of the lstm models and language models, handwriting recognition, sequence generation, machine translation, speech, video analysis and other tasks have achieved some successful applications, but few people to study exactly what causes these situations, and which design is effective.

"Visualizing and understanding Recurrent Networks" Andrej karpathy,justin johnson,li Fei-fei

This article is a piece of an article by Stanford's Ms Li teacher instructing her students, offering some ideas. (This article of all kinds of charts is really dazzling, after all, Ms Li teacher team's front-end of the skill).

A series of experiments were done on the status between the gates, in the alphabet-level language sequences and language models, and the RNN and GRU models were compared.

Here is a little more interesting point, as below, the experiment of this article is to use the "War and Peace" and the source of Linux (...) ) on two datasets.

The authors were pleasantly surprised to find that some of the units in Lstm's memory units had surprising features, such as the possibility of marking the end of a paragraph, or referring to content, or the content of some conditional statements in Linux's source code.

Of course, most of them are hidden layers that we can't explain.


In addition this article is mentioned as,

K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J.schmidhuber. Lstm:a Search Space

Odyssey. CoRR, abs/1503.04069, 2015.

This article also makes some experiments, and puts forward the conclusion that:

"The Forget Gates is its most criticalcomponents"

forgotten Door, played the most important role. (and the LSTM prototype has no forgotten doors ...)



In addition, this article also mentions some statistics about the state of each gate,



The general situation can be expressed in this way, the forget gate tends to be inactive (bias 0), that is, frequent forgetting things.

Input gate tends to be usually out of the open state.

The Output gate is open and closed with a basic balance. 7. Some of the problems that may exist

1. Theano LSTM Example thread gradient return algorithm, due to Theano automatic derivative function, in the return time should be used is full BPTT, this will bring some as mentioned before the problem.

A: I've actually explained some of Alex's trick practices before. Under careful consideration, it is true that the problem of gradients brought by full bptt is mitigated by the integrity of the memory cell function. That is to say, the main force of the error return is through the memory cell and kept down. So we now use the LSTM model, still have a better effect.

2. lstm the function between input, Output, forget door is actually duplicated. There is no simpler structure that can be improved.

A: Yes, for example, the GRU (Gated recurrent Unit) has appeared 8. Summary

Two key questions:

1. Why it has a memory function.

This is the problem solved in RNN, because there is recursive effect, the last moment the state of the hidden layer involved in the calculation process of this moment, a straightforward statement that is the choice and decision-making reference to the last state.

2. Why lstm for a long time.

Because the design of the structure has the characteristics of CEC, error up a previous state passed almost no attenuation, so when the weight adjustment, for a long time before the impact of the state and the impact of the end state can play a role at the same time, the last training model has a longer time range of memory function. 9. References

1. Visualizing and Understandingrecurrent Networks

2. On the difficulty of Trainingrecurrent network

3. hochreiter97_lstm

4. Oxford, Rnn & LSTM

5. The Vanishing GRADIENT problemduring RNN

6. Recurrent neural networkregularization

7. A. Graves. Supervised sequencelabelling with recurrent neural Networks.

8. A Guide to recurrent Neuralnetworks and back-propagation

9. An empirical exploration ofrecurrent Network architectures

A Survey on the application Ofrecurrent neural networks to statistical language modeling

Generating sequences with RNN _alex graves

On the Properties of Neuralmachine translation Encoder–decoder 1409 1259v2

Encodere mechaine translationuse CNN 1503.01838v5

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.