Theoretically, as long as the RNN structure is large enough to generate arbitrarily complex sequence structures.
But in fact, the standard RNN is not effective long-term preservation of information (this is due to the HMM structure, each time the information of each node is always the same transformation, then either the exponential explosion or exponential decay, the information will be lost soon). It is also due to its "forgetfulness" characteristic that the sequence generated by this rnn is prone to lack of stability. In this case, if you can only rely on the results of the last few steps to predict the next step, and use the new predicted results to predict the next step, then once the error, the system will be able to go down in the wrong direction, and there is little opportunity to correct the error from the previous information.
From this perspective, if the RNN can have a "long-term memory", it will have a very good stability, because even if it is not sure that the current steps are not correct, it can also get some "inspiration" from earlier information to form new predictions.
(one might say that if you are training RNN, you can add noise and other methods to keep it stable when encountering strange inputs.) But we still feel that the introduction of better memory methods is more efficient and long-term development of the move. )
LSTM
Lstm refers to long short-term Memory. This is a structure that was developed in the 1997.
Probably.
The design of this structure is very delicate, including the input gate, the forgetting gate and the output gate. These three types of gates are controlled by specific data, with a value of {0, 1} (which is actually approximated by a sigmoid or tanh function, because discrete 0 and 1 are not derivative.) )。 The input gate, the output gate, the forgetting gate and the cell are the same {0,1} vectors as the HT size.
For example:
The input gate is 0, the forgetting gate is 1, the output gate is 1 when the LSTM unit will not accept this input data and return the last recorded data again. (similar to read-only)
The input gate is 1, the forgotten Gate is 0, the output gate is 1 when the LSTM unit will empty the previous "memory", only the information from the XT to HT, while recording down. (similar to refresh)
The input gate is 1, the Forgotten Gate is 1, the output gate is 0 when the LSTM unit will add this input information to the memory but will not continue to pass. (similar to storage)
Wait a minute...
If it's not clear enough, it would be better to look at the transfer formula between them.
(where σ (x) represents the sigmoid function)
The W matrix is diagonal array , which means that each gate element is obtained by the corresponding dimension data, that is, non-interference!
The original LSTM algorithm has a custom-made degradation approximate method so that the weight of the network can be updated at every point in time, but the entire degradation can be done by the reverse propagation of time, which is also used here method. However, there is a problem to be solved that some derivatives become very large, resulting in a difficult calculation. To prevent this from happening, all of the experiments in this article stop the derivation of the input to the entire network in the LSTM layer (before sigmoid and Tanh use). (I'll talk about it later)
Recurrent neural Network study note "Two" rnn-lstm