Cyclic neural network--Realization
Gitbook Reading Address
Knowledge of reading address gradients disappearing and gradient explosions
Network recall: In the circular neural network-Introduction, the circular neural network is referred to in the same way to process data at every moment.
* Dynamic diagram:
Mathematical formula: Ht=ϕ (wxh⋅xt+whh⋅ht−1+b) H T =ϕ (W x h⋅x t + W h h⋅h t−1 + b)
Design Objective: We hope that the circular neural network can pass the state information that happened at the past moment to the calculation of the current moment.
Practical problem: But the ordinary RNN structure is difficult to transmit the information that is far apart.
* Consider: If you only look at the blue Arrow line, the hidden state of the transfer process, do not consider the non-linear part, then you will get a simplified formula (1): (1) ht=whh⋅ht−1 h = W h h⋅h t−1
If you pass the hidden state information of the start time H0 h 0 to T T, you get an equation (2)
(2) ht= (WHH) t⋅h0 h t = (W H h) t⋅h 0
WHH W h will be multiplied multiple times, and if the matrix whh W H H is allowed to perform feature decomposition
(3) whh=q⋅λ⋅qt W h = q⋅λ⋅q T
The equation (2) will become (4)
(4) ht=q⋅λt⋅qt⋅h0 h t = q⋅λt⋅q t⋅h 0
When the characteristic value is less than 1 o'clock, the result of multiplication is that the T secondary direction of eigenvalues is 0 0 attenuation;
When the eigenvalues are greater than 1 o'clock, the result of multiplication is that the T-secondary direction of the eigenvalues is ∞∞ amplified.
The information in the H0 H 0 that you want to pass is masked and cannot be passed to HT H T.
Analogy: imagine y=at∗x y = a t∗x, if a equals 0.1,x how small it will become when multiplied by 0.1 100 times. If a equals 5,x, it will become much larger after being multiplied 5,100 times. If you want X to contain information that does not disappear or explode, you need to keep the value of a at 1 as much as possible. Note: See Deep Learning by Ian Goodfellow in the tenth chapter for more information. Long short Term Memory (lstm)
The above phenomenon may not mean that you cannot learn, but even if you can, it will be very, very slow. In order to effectively use the gradient descent method, we want to keep the product of the constant multiplication of gradients (the product of derivatives) at a value of close to 1.
One way to achieve this is to establish a linear self connection unit (linear self-connections) and a weight that is close to 1 in the Self Connection section, called Leaky units. However, the linear leaky units weights are manually set or set as parameters, and the most effective way to gated rnns is through gates regulation, allowing the weight of linear self connection can be adjusted in each step of the self. LSTM is an implementation of gated Rnns. A preliminary understanding of lstm
LSTM (or other gated Rnns) is equipped with a number of control levels based on the standard RNN (Ht=ϕ (wxh⋅xt+whh⋅ht−1+b) H T =ϕ (W x h⋅x t + W h h⋅h t−1 + b)) (Magnit Ude) gates. Can be understood as a neural network (RNN) in addition to other neural networks (gates), and these gates is only control the number of levels, control the flow of information.
Mathematical formula: Here is a basic lstm of the mathematical formula, look at a good, just to let everyone leave an impression, do not need to remember, do not need to understand.
It=sigmoid (wxixt+whiht−1+bi) i t = s i g M o i d (w x I x t + W h i h t−1 + b i)
Ft=sigmoid (WXFXT+WHFHT−1+BF) F t = s I g M o i D (W x F x t + W h F h t−1 + B f)
Ot=sigmoid (wxoxt+whoht−1+bo) o t = s i g M o i d (w x o x t + W h o h t−1 + b o)
Ct=ft⊙ct−1+it⊙tanh (WXCXT+WHCHT−1+BC) c t = f t⊙c t−1 + i t⊙t a n h (W x C x t + W h c h t−1 + b c)
Ht=ot⊙tanh (CT) H t = o t⊙t a n h (c t)
Although the formula is not complex, but contains a lot of knowledge, the next step is to analyze these formulas and the rationale behind.
For example, the meaning and use reason of ⊙⊙, sigmoid use reason. The understanding of Gate (gate)
The first step in understanding gated Rnns is to understand what role gate plays. Physical meaning: Gate itself can be regarded as a very physical neural network.
Input: Gate input is the control basis; output: Gate output is the value of the range (0,1) (0, 1), which means how to adjust the control of the number of other data levels. Usage: The output generated by gate will be used to control the number of other data, equivalent to the role of the filter.
Analogy: You can think of information as a flow, and gate is the control of how much water flow can flow.
For example: When using gate to control vectors [20578] [20 5 7 8],
If gate's output is [0.10.20.90.5] [0.1 0.2 0.9 0.5],
The original vector is multiplied by the corresponding element (Element-wise) and becomes:
[20578]⊙[0.10.20.90.5] [20 5 7 8]⊙[0.1 0.2 0.9 0.5] =[20∗0.15∗0.27∗0.98∗0.5]=[216.34] = [20∗0.1 5∗0.2 7∗0.9 8 ∗0.5] = [2 1 6.3 4] If gate output is [0.50.50.50.5] [0.5 0.5 0.5 0.5],
The original vector is multiplied by the corresponding element (Element-wise) and becomes:
[20578]⊙[0.50.50.50.5]=[102.53.54] [20 5 7 8]⊙[0.5 0.5 0.5 0.5] = [10 2.5 3.5 4] control basis: After the output of gate, the remaining to determine what information to control the basis, That is, what is gate input.
For example, even the LSTM has many variants. One variant is to regulate the input of the gate. For example, the following two gate types:
G=sigmoid (wxg⋅xt+whg⋅ht−1+b) g = s i g M o i d (w x g⋅x t + W h g⋅h t−1 + B):
This gate input has the current input XT x T and the hidden state of the last moment ht−1 H t−1,
Indicates that gate generates output as a control basis for these two streams of information. G=sigmoid (wxg⋅xt+whg⋅ht−1+wcg⋅ct−1+b) g = s i g M o i d (w x g⋅x t + W h g⋅h t−1 + w c g⋅c t−1 + B):
This gate input has the current input XT x T and the hidden state of the last moment ht−1 H t−1, and the cell state ct−1 C t−1 in the last moment,
Indicates that gate generates output from these three streams as a control basis. The lstm of this way is called peephole connections. Lstm's understanding again
See, Gate, and then look back at Lstm's mathematical formula.
Mathematical formula:
It=sigmoid (wxixt+whiht−1+bi) i t = s i g M o i d (w x I x t + W h i h t−1 + b i)
Ft=sigmoid (WXFXT+WHFHT−1+BF) F t = s I g M o i D (W x F x t + W h F h t−1 + B f)
Ot=sigmoid (wxoxt+whoht−1+bo) o t = s i g M o i d (w x o x t + W h o h t−1 + b o)
Ct=ft⊙ct−1+it⊙tanh (WXCXT+WHCHT−1+BC) c t = f t⊙c t−1 + i t⊙t a n h (W x C x t + W h c h t−1 + b c)
Ht=ot⊙tanh (CT) H t = o t⊙t a n h (c t)
Gates: First the first half of the three formula It,ft,ot I T, f t, O t Unified understanding. In Lstm, the network first built 3 gates to control the flow of information.
Note: Although Gates is the same as the formula, but note that 3 gates of the W and b of the lower angle is not the same. They have their own physical significance, and they will have different weights in the process of learning the Web.
With these 3 gates, the next thing to consider is how to use them to control the flow of information in an ordinary rnn, and they are divided into: input gate it i t: how much information can flow into the memory cell (fourth CT c t), depending on where they are used to control the flow. Forgotten door ft f t: control how many memory cell information in the last moment can accumulate into the memory cell in the current moment. Output gate ot o T: The information in the memory cell that controls how much of the current moment can flow into the current hidden state ht H T.
Note: Gates does not provide additional information, gates only plays a role in limiting the amount of information. Because Gates acts as a filter, the activation function used is sigmoid rather than tanh. Information flow: The source of the information flow is only three, the current input XT x T, the hidden state of the last moment ht−1 H t−1, the cell state of the last moment ct−1 C t−1, where ct−1 C t−1 is an extra created, linearly self-contained unit (recall leaky Units). The true source of information can be said to be only the current input XT x T, the hidden state of the last moment ht−1 H t−1 two. The control basis of three gates and the update of the data are all from these two places.
After analyzing gates and the information flow, we analyze the remaining two equations to see how lstm accumulates historical information and calculates the hidden state H H.
Historical information accumulation: formula: Ct=ft⊙ct−1+it⊙tanh (WXCXT+WHCHT−1+BC) c t = f t⊙c t−1 + i t⊙t a n h (W x C x t + W h c h t−1 + b c)
where New=tanh (WXCXT+WHCHT−1+BC) n e w = t a n h (W x C x t + W h c h t−1 + b c) is the information source of this secondary accumulation. Rewriting: ct=ft⊙ct−1+it⊙new c t = f t⊙c t−1 + I t⊙n e W
So the accumulation of historical information is not by hiding the state H H itself, but by relying on the memory cell this self connection to accumulate.
At the time of accumulation, by forgetting the gate to limit the information of the memory cell at the last moment, and by entering the gate to restrict the new information. And really reached the idea of leaky units, the memory cell's self connection is linear accumulation.
Calculation of the current hidden state: So much of the trouble eventually remains the same as normal RNN to compute the current hidden state. Formula: Ht=ot⊙tanh (CT) H t = o t⊙t a n h (c t)
The current hidden state HT H T is computed from CT c t, because CT C t is self-renewal in a linear way, so it is first added to the tanh (CT) t a n h (c t) with non-linear function.
Then the output gate ot o T is filtered to get the current hidden state ht H T. Comparison of common RNN and lstm
In order to deepen understanding of the core of the circular neural network, and Yjango together to compare the common rnn and lstm differences.
Comparison formula: The biggest difference is three more neural networks (gates) to control the flow of data. Ordinary Rnn:ht=tanh (wxh⋅xt+whh⋅ht−1+b) H t = t a n h (W x h⋅x t + W h h⋅h t−1 + b)
Lstm:ht=ot⊙tanh (Ft⊙ct−1+it⊙tanh (WXCXT+WHCHT−1+BC)) H t = o t⊙t a n h (f t⊙c t−1 + i t⊙t a n h (W x C x t + W h C h t−1 + b c))
Comparison: Both sources of information are tanh (WXH⋅XT+WHH⋅HT−1+B) t a n H (W x h⋅x t + W h h⋅h t−1 + b),
The difference is that LSTM relies on 3 gates to build the accumulation of information on a linearly self-contained memory cell, and use it as an intermediary to compute the current HT H T.
Figure comparison: Pictures from understanding LSTM, strongly recommended to read together. Ordinary RNN:
LSTM: The plus circle indicates linear addition, and the multiplication circle indicates that gate is used to filter the information.
Comparison: The new information from the yellow tanh, linearly accumulated into the memory cell, and then from the Red Tanh added nonlinearity and returned to the hidden state of HT H T calculation.
LSTM relies on 3 gates to build the accumulation of information on a linearly connected weight of nearly 1 memory cell, and use it as an intermediary to compute the current HT H T. Lstm's analogy memory
For LSTM to achieve RNN memory, you can compare our mobile phone (just for the convenience of memory, not one by one corresponding).
Ordinary rnn is like a cell phone screen, and lstm-rnn is like a cell phone film.
A large number of nonlinear cumulative historical information will cause the gradient to disappear (the gradient explosion) is like the continuous use of the screen can easily scrape flowers.
Lstm the accumulation of information on a linearly memory cell, and relies on it as an intermediary to compute the current HT h T * * * * * like using a cell phone screen film as an intermediary to watch the phone screen.
Input door, forgotten door, output door filtering function is like the mobile phone screen film reflectivity, absorption rate, transmission rate of three kinds of properties. Variants of gated Rnns
Again, the neural network is called a network because it is free to create a reasonable connection. And the lstm described above is just the most basic lstm. As long as a few key points are adhered to, the reader can design their own gated Rnns according to the requirement, and the effect on the different tasks needs to be validated by experiment. The following is a brief introduction to the design directions of several gated Rnns variants that Yjango understands.
Information flow: The standard RNN information flow has two places: input input and hidden state hidden.
But often the flow of information is not only two, even if there are two, can be divided into multiple places, and through a clear flow of information between the structure of the relationship to add a priori knowledge, reduce the amount of data needed to train, so as to improve the network effect.
For example: The application of tree-lstm in natural language processing tasks with such a structure.
Gates control way: and Lstm as famous is gated recurrent unit (GRU), and GRU use gate way and lstm different, GRU only two gates, will lstm in the input door and forgotten door merged into an update door. And does not establish the linear self update on the additional memory cell, but the direct linear accumulation establishes in the hidden state, and relies on gates to control.
Gates ' control basis: The three gates used in the lstm described above are based on wxt+wht−1 W x t + W h t−1, but can be increased by connection to the memory cell or by deleting a gate's wxt W x t or wht−1 W H t−1 to reduce the control basis. For example, remove the ht−1 H t−1 in Zt=sigmoid (wz⋅[ht−1,xt]) Z t = S i g M o i d (W z⋅[h t−1, X T]) in the above image to Zt=sigmoid (wz⋅ht−1) Z t = s i g M o i d (W z⋅h t−1)
After the introduction of the circular neural network--realizing Lstm,
The next third article, "Circular Neural network-code" is to use TensorFlow to achieve network content from the beginning.