Some time ago read some about the lstm aspect of the paper, has been prepared to record the learning process, because other things, has been dragged to the present, the memory is fast blurred. Now hurry up, the organization of this article is like this: first introduce the problems of RNN BPTT, then introduce the original LSTM structure, in the introduction of the forgotten control door, and then add the peephole connections structure lstm, are written in the chronological order in which they are actually presented. This paper is equivalent to a summary of the core parts of each paper to do the notes, has provided a quick understanding.
I. Problems in the BPTT learning algorithm of RNN structure
First take a look at the more typical bptt an unfolded structure, as shown below, where only part of the diagram is considered, because the rest is not what is discussed here.
The error signal for T-time is calculated as follows:
Such weights are updated in the following ways:
The formula above is very common in bptt, so if the error signal goes all the way to the past, suppose any two nodes U, v their relationship is the following:
Then the relation of the error transmission signal can be written as follows:
n represents the number of layers of neurons in a graph, this recursive type of general meaning is not difficult to understand, requires t-q moment error signal to T time error signal deviation, first find t-q+1 moment for T moment, and then the result of the finding to t-q moment, the recursive stop condition is q = 1 o'clock, It's the part of the formula that's just beginning to write. After the above recursive expansion can be obtained:
The paper said that can be summed up to prove that I did not carefully scrutinize here, the inside of the expansion to see easy to understand a point:
The sum of the total result for T is n^ (Q-1), that is, T has n^ (q-1), then see where the problem is.
If | T| > 1, the error will increase exponentially with the increase of Q, then the network parameter update can cause very big concussion.
If | T| < 1, the error will disappear, resulting in learning is invalid, the general activation function with the Simoid function, its reciprocal maximum value is 0.25, the maximum weight value of less than 4 to ensure that no less than 1.
The phenomenon that the error is exponential growth is relatively few, the error disappears in BPTT is very common. In the original paper there are more detailed mathematical analysis, but the understanding of this person is enough to understand the problem.
two. The original LSTM structure
In order to overcome the problem of error disappearance, some restrictions need to be made to assume that only one neuron is connected to itself, as shown in the following diagram:
According to the above, the T-moment error signal is calculated as follows:
In order to make the error not change, you can force the order to be 1:
According to this equation, you can get:
This means that the activation function is linear and often makes FJ (x) = x, WJJ = 1.0, thus obtaining constant error flow, also known as CEC.
But this is not the case, because there is an input and output value of the conflict of the update (here the original paper in the explanation I am not very clear), so add two control doors, respectively, input gate, output gate, to resolve this contradiction, the figure is as follows:
The figure adds two control gates, the so-called control means is to calculate the input of CEC, multiplied by input gate output, the output of CEC, the result of the output gate, the whole box is called block, the middle of the small circle is CEC, inside is a y = The line of x indicates that the activation function of the neuron is linear, and the weight of the self connection is 1.0.
three. Increase Forget gate
One disadvantage of the initial lstm structure is that the state value of CEC may continue to increase, and the state of CEC can be controlled after the addition of Forget gate, and its structure is as follows:
Here the equivalent of the connection weight is no longer 1.0, but a dynamic value, this dynamic value is the output value of forget gate, it can control the state of CEC, if necessary to make it 0, that is, forget the role, as 1 o'clock and the original structure.
four. Increase the LSTM structure of peephole
Add to the forgotten door a disadvantage is that the current CEC state cannot affect the input gate, forget gate output at the next moment, so the peephole connections is added. The structure is as follows:
There is an additional source in the input section of gate here, forget gate, input Gate source added to CEC at the previous time output, the input source of output gate increased the current time of CEC output, the other order of calculation must also be guaranteed as follows: input Gate, the input output cell of forget gate, the output of the input output cell (which is also the output of the block).
Five. Full bpTT derivation of a lstm (with error signal)I remember at that time to see the thesis formula derivation when many places more difficult to understand, and finally casually Google a few times, to find a good writing a similar courseware PDF, but has not known the source, it is easy to read lstm forward calculation, error back to the update. Put the section on the LSTM, first the complete structure of the network is as follows:
This structure is also the structure of the lstm in the RWTHLM source package, here's a look at the notation of the formula: Wij represents the connection weights from neurons I to J (Note that this and many of the essays are reversed) the input of neurons is expressed in a, and the output uses B to denote the subscript ι,φ and ω, respectively, for input gate. Forget Gate,output gate C subscript The cell, from cell to input, forget and output gate peephole weights are respectively recorded wcι, Wcφand WCΩSC to represent the activation of the state control gate of Cell C The function uses F to represent the cell's input-output activation function I, which represents the number of neurons in the input layer, K is the number of neuron in the output layer, and H is the number of g,h of the hidden cell:
Error back-transmission update:
Original address: http://blog.csdn.net/a635661820/article/details/45390671