Refer to:
Https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
(The fall of Rnn/lstm)
"hierarchical neural attention encoder", shown in the figure below:
Hierarchical neural Attention Encoder
A better-to-look-into-the-past is-to-use attention modules-summarize all past encoded vectors into a context vector Ct.
Notice There is a hierarchy of attention modules here, very similar to the hierarchy of neural networks. This was also similar to temporal convolutional network (TCN), reported in Note 3 below.
In the hierarchical neural attention encoder multiple layers of attention can look at a small portion of recent p AST, say vectors, while layers above can look at over these attention modules, effectively integrating the Informati On of the X-vectors. This extends the ability of the hierarchical neural attention encoder to the past vectors.
This is the the
-the-look-back-to-the-past and be-able to influence.
Importantly look at the length of the path needed to propagate a representation vectors to the output of the Netwo Rk:in hierarchical networks It was proportional to log (n) where N was the number of hierarchy layers. Contrast to the T steps that a RNN needs to do, where T is the maximum length of the sequence, and T >> N.
It's easier to remember sequences if your hop 3–4 times, as opposed to hopping
times!
The fall of rnn/lstm-hierarchical neural attention encoder, temporal convolutional network (TCN)