@ Translation: Huangyongye
Original link: Understanding Lstm Networks
Foreword : Actually before already used lstm, is in the depth study frame Keras to use directly, but to the present to LSTM detailed network structure still does not understand, the heart is worried about is uncomfortable. Today, read the TensorFlow document recommended this blog, after reading this, beginnings Dawu, the structure of lstm understanding is basically not a big problem. This blog is really good writing ... In order to help people understand, but also afraid of the future of these forgotten words can quickly recall, so intend to write a translation of the original. First of all, because of my limited level, if there is a poor translation or wrong interpretation of a lot of points. In addition, this translation is not corresponding to the original word, in order to facilitate understanding may make some adjustments and modifications. ) 1. Cyclic neural Network (Rnns)
People think about problems often not from scratch. As you read this article now, your understanding of each word will depend on some of the words you saw before, instead of discarding all that you saw before, forgetting and understanding the word. In other words, people's thinking will always have continuity.
Traditional neural networks do not have such continuity (they cannot preserve the understanding of the previous text), which seems to be a huge flaw. For example, in watching a movie, you find a way to make a classified understanding of what is going on in each frame. There is no clear way to use the traditional network to add to the events that occurred in front of the movie to help understand the image behind it.
However, cyclic neural networks can be done. In a Rnns network, there is a circular operation that allows them to retain what they have learned before.
Fig1. RNNS Network Structure
In the above graph network structure, for the part of the rectangular block A, by inputting the XT x_t (eigenvector of t time), it outputs a result HT h_t (T-time State or output). The loop structure in the network allows the state of a moment to pass to the next moment. (Translator: Because the state of the current moment will be part of the input at the next moment)
The structure of these loops makes Rnns seem a little hard to understand. But if you think about it a little bit, it seems like there's a lot of resemblance to a normal neural network. We can think of Rnns as a common network that has been duplicated multiple times to make up a stack. Each network will pass its output to the next network. We can expand the Rnns in a time step and get the following image:
Fig2. Rnns Expand Network Structure
It is easy to understand from a Rnns chain-like structure that it is related to sequence information. This structure seems to be born to solve sequence-related problems.
And they are really very useful. In recent years, people have been using Rnns to solve a variety of problems: speech recognition, language model, translation, image (add) subtitles, and so on. About Rnns's amazing success in these areas, we can see Andrej karpathy's Blog: The unreasonable effectiveness of recurrent neural.
Rnns can achieve such success, mainly the use of LSTMS. This is a special kind of rnns, and for many tasks, it is much better than the ordinary Rnns effect. Basically, the Loop neural network now used is LSTMS, which is the network explained later in this article. 2. Long-term dependency problems
The emergence of Rnns is mainly due to the fact that they are able to relate the previous information to the present and thus solve the current problem. For example, using the previous screen can help us understand the content of the current screen. If Rnns really can do this, then it must be helpful to our mission. But can it really do it, I'm afraid it depends on the actual situation.
Sometimes we just need to look at some of the more recent information when we're dealing with the current task. For example, in a language model, we have to predict what a word might be by using the above, so when we see "The clouds are in the" , we can automatically think that the next word should be "sky" when there is no need for more information. In this case, the gap between what we want to predict and the information is very small, in which case Rnns can use the information of the past, easy to implement.
Fig2. Short-term dependence
However, there are situations where more contextual information is required. For example, we want to predict "I grew up in France ... (10,000 words omitted here) ... I speak? " This prediction word should be franch, but we have to pass a long long time before the information to make this correct prediction of Ah, ordinary rnns difficult to do this.
As the spacing between predictive and related information increases, it is hard for Rnns to correlate them.
Fig3. Long-term dependence
Theoretically, by choosing the right parameters, Rnns can indeed relate this long period of dependence ("long-term dependencies") and solve such problems. But unfortunately in practice, Rnns cannot solve the problem. Hochreiter (1991) [German] and Bengio, et al. (1994) There has been a thorough study of this problem, and it is found that Rnns is really difficult to solve this problem.
But luckily, Lstms can help us solve this problem. 3. Lstm Network
Long short-term memory networks (Term Memory networks)-often called "LSTMS"-are a special type in RNN. Proposed by Hochreiter & Schmidhuber (1997), it has been widely popular and has been adjusted by many people. Lstms is widely used to solve all kinds of problems and has achieved great results.
Specifically, the design LSTMS is primarily designed to avoid the long term dependency (long-term dependency) problem mentioned earlier. Their essence is to be able to remember information over a long period of time, and it is very easy to do.
All cyclic neural network structures are duplicated by identical structures (neural network) modules. In the ordinary Rnns, the module structure is very simple, such as only a single tanh layer.
Fig4. Internal structure of ordinary Rnns
LSTMS also has a similar structure (translator Note: The only difference is the middle part). But instead of just using a single tanh layer, they use four interacting layers.
Fig5. LSTM Internal structure
Don't worry, don't let this structure scare you, under this structure, we dissect it, step by step to understand it (you must be able to understand patiently). Now, let's start by defining the symbols used:
Fig6. Symbol description
In a network map, each line is passed a vector, output from one node, and then entered into another node. The pink circle indicates a dot-by operation, for example, a vector is added; a yellow rectangular box represents a neural network layer (a number of neural nodes); A merged line represents the merging of vectors carried on two lines (for example, one with ht−1 h_{t-1} and the other with a XT x_t), then the merged output is [ HT−1,XT] [h_{t-1}, x_t]); A separate line represents a copy of the vector that is passed on the line and is passed to two places. the core thought of 3.1 Lstms
The key to Lstms is the state of the cell (the whole green box is a cell) and the horizontal line across the chart.
The transmission of the cell state is like a conveyor belt, and the vector passes through the entire cell, only doing a small amount of linear manipulation. This structure makes it easy to implement information from across the cell without making changes. (Translator Note: So we can achieve a long period of memory retention)
Fig7. Conveyor structure
If only the above horizontal line is not able to implement add or delete information. It is accomplished by a structure called gate Gates .
The gate can be implemented selectively to allow information to pass through, mainly through a sigmoid neural layer and a dot-by-point operation.
Fig8. Door structure (sigmoid layer)
Each element of the sigmoid layer output (a vector) is a real number between 0 and 1, indicating the weight (or ratio) that the corresponding information passes through. For example, 0 means "do not let any information pass," and 1 means "Let all information pass."
Each lstm has three such gate structures to implement protection and control information. (Translator note: respectively, "Forget gate layer", forgotten door; "Input gate layer", incoming door; "Output gate layer", export gate) 3.2 gradually understand lstm
(well, finally came to the most exciting moment) 3.2.1 Forgotten Door
The first is that LSTM decides to let that information continue through the cell, which is achieved through a sigmoid neural layer called the "Forget Gate layer". Its input is ht−1 h_{t-1} and XT x_t, the output is a vector (vector length and cell state ct−1 c_{t-1}) between 0 and 1, representing the proportion of the information passed by the ct−1 c_{t-1}. 0 means "Do not let any information pass", 1 means "Let all information pass."
Back to the language model we mentioned above, we have to predict the next word based on all the above information. In this case, the status of each cell should contain the gender information (retention information) of the current subject, so that we can use the pronoun correctly next. But when we start to describe a new subject, we should forget about the subject's sex (Forget the information).
Fig9. Forgotten Gate (Forget Gates)
3.2.2 Incoming Door
The next step is to decide how much new information will be added to the cell state. This requires two steps: First, a sigmoid layer called input Gate layer determines which information needs to be updated, and a tanh layer generates a vector, which is the alternative to update the content, ct~ \tilde{c_t}. In the next step, we combine the two parts to update the status of the cell.
Fig10. Incoming gate (input gates)
In our example of a language model, we want to add new subject sex information to the cell state to replace the old state information.
With the above structure, we are able to update the cell state, that is, to update the ct−1 c_{t-1} to the Ct c_{t}. From the structure diagram should be able to glance, first we put the old state ct−1 c_{t-1} and