The Fall of Rnn/lstm

Source: Internet
Author: User

We fell for recurrent neural networks (RNN), Long-short term-memory (LSTM), and all their variants. Now it's time to drop them!

IT is the year 2014 and Lstm and RNN make a great come-back from the dead. We all read Colah's blog and Karpathy ' s ode to RNN. But We were all young and unexperienced. For a few years this is the way to solve sequence learning, sequence translation (SEQ2SEQ), which also resulted in Amazin G results in speech to text comprehension and the raise of Siri, Cortana, Google voice assistant, Alexa. Also let us not forget machine translation, which resulted in the ability to translate documents into different languages or neural machine translation, but also translate into text, text to images, and images video, and ... Got the idea.

Then in the following years (2015–16) came resnet and Attention. One could then better understand that lstm were a clever bypass. Also attention showed that MLP network could is replaced by averaging networks to the context vector. More on this later.

It only took 2 years, but we can definitely say: "Drop your RNN and lstm, they are no good!"

But do not take we words for it, also the evidence that Attention based networks are the more and more by Google, used OK, Salesforce, to name a few. All this companies have replaced RNN and variants for attention based, and it is models the just. RNN have the days counted in all applications, because they require-more resources to train and run than attention-based m Odels. This post is for more info. But Why?

Remember RNN and lstm and derivatives use mainly sequential. The horizontal arrow in the Diagram below:sequential processing in RNN, from:http://colah.github.io/posts/2015-08-u nderstanding-lstms/

This arrow means so long-term information has to sequentially travel through all cells before to the Getting pro Cessing cell. This means it can is easily corrupted by being multiplied many time by small numbers < 0. This is the cause of vanishing gradients.

To the rescue, came the Lstm module, which today can is seen as multiple switch gates, and a bit like resnet it can bypass Units and thus remember for longer time steps. Lstm thus have a way to remove some of the vanishing gradients. Sequential processing in lstm, from:http://colah.github.io/posts/2015-08-understanding-lstms/

But not all to it, as you can the figure above. Still we have a sequential path from older past to the "current one." In fact the "path is" even more complicated, because it has additive and forget branches to it. No question Lstm and GRU and derivatives are able to learn a lot of longer term information! results here; But they can remember sequences of 100s, not 1000s or 10,000s or more.

And one issue of RNN is this they are not hardware friendly. Let me explain:it takes a lot the we do not have to train these network fast. Also It takes much of the cloud, and given this demand for speech-to-text is growing RAPI Dly, the cloud is not scalable. We'll need to process at the edge and right into the Amazon echo! below for more details. What do?

If sequential processing is to being avoided, then we can find units "look-ahead" or better "Look-back", since most of T He time we deal with real-time causal data where we know the past and want to affect future. Translating sentences, or analyzing recorded videos, for example, where we have all data and can reason on it mo Re time. Such Look-back/ahead units are neural attention modules, which we previously here.

To the rescue, and combining multiple neural attention modules, comes the "hierarchical neural the attention encoder", shown I n the figure below:hierarchical neural Attention Encoder

A better way to look into the past are to use attention modules to summarize all past encoded into a context vector Ct.

Notice There is a hierarchy of attention modules here, very similar to the hierarchy of neural. This is also similar to temporal convolutional network (TCN), reported in 3 below.

In the hierarchical neural attention encoder multiple layers of attention can look at a small portion of recent past, say Vectors, while layers above can look at the attention modules, effectively integrating the information of 100 X vectors. This is extends the ability of the hierarchical neural attention encoder to 10,000 past. This is the "way to" past and be able to influence the future.

But More importantly look at the length of the path needed to propagate a representation vector to the output of the Netwo Rk:in hierarchical networks It is proportional to log (n) where n are the number of hierarchy layers. This is in contrast to the T steps that a RNN needs to does, where T is the maximum length of the sequence to be remembered, and T >> N. It is easier to remember sequences if Hop 3–4 times, as opposed to hopping

This architecture are similar to a neural Turing machine, but lets the neural-network decide what are read out from memory V IA attention. This means a actual neural network'll decide which vectors from the past are for important.

But What about storing to memory? The architecture above stores all previous representation in memory, unlike neural. Turning machines. This can is rather inefficient:think about storing the representation of every frame in a video-most times the Represen Tation vector does not change frame-to-frame, so we really are storing a much of the too What can we do are add another to prevent correlated the data to be stored. For example by not storing vectors too similar to previously stored. But This is really a hack, the best would being to being let the application-guide what vectors should be saved or not. This is the focus of the current studies. Stay tuned for more information. So in summary forget RNN and variants. Use attention.  attention really are all you need!

Tell your friends! it are very surprising to us to-so many companies still with rnn/lstm for speech to text, mans Y unaware that this networks are so inefficient and not scalable. Please tell the them about this post. Additional Information

About training Rnn/lstm: rnn and lstm are difficult to train because they require memory-bandwidth-bound Computati On, which are the worst nightmare for hardware designer and ultimately limits the applicability of neural networks S. In short, lstm require 4 linear layer (MLP layer) per cell to run at and for each sequence time-step. Linear layers require large amounts of memory bandwidth, in computed fact they-use cannot many unit compute Cause the system has not enough memory bandwidth to feed the computational units. And it is easy to add computational units, but hard to add more memory bandwidth (note enough lines on a chip, long W Ires from processors to memory, etc). As a result, rnn/lstm and variants are not a good the match for hardware acceleration, and we talked about this issue BEFORE&N Bsp;here and here. A solution would be compute into memory-devices like the ones we work on AT&NBSP;FWDNXT. Notes

Note 1:hierarchical Neural attention are similar to the ideas in wavenet. But instead of a convolutional neural network we use hierarchical attention. Also:hierarchical neural attention can be Also bi-directional.

Note 2:rnn and lstm are memory-bandwidth limited problems (for details). The processing unit (s) need as much memory bandwidth as the number of operations/s they can provide, making it impossible To fully utilize them! The external bandwidth is never going to being enough, and a way to slightly ameliorate the problem are to use internal fast C Aches with high bandwidth. The best way is to use techniques this does not require large amount of parameters to being moved back and forth from memory, O R that can is re-used for multiple computation/byte transferred (high arithmetic intensity).

Note 3:here are a paper comparing CNN to RNN. Temporal convolutional Network (TCN) "outperform canonical recurrent networks as such lstms a across range of task S and datasets, while demonstrating longer effective memory ".

Note 4:related to this topic, are the fact that we know little of how we human brain learns and remembers Seque NCEs. "We often learn and recall long sequences in smaller segments, such as a phone number 858 534 memorized as four SEGM Ents. Behavioral experiments suggest that humans and some animals employ this strategy to breaking down cognitive or behavioral Sequences into chunks in a wide variety of tasks "-these chunks me to remind small or convolutional like attention KS on smaller sequences, which then are hierarchically strung together like in the hierarchical neural attention Er and Temporal convolutional Network (TCN).  more Studies make me that's working memory is similar to R NN networks that uses recurrent real neuron networks, and their capacity are very low. On the other hand both the cortex and hippocampus give us ability to remember really long sequences of steps (Like:where did I Park, at airport, 5 days), suggesting, parallel pathways May is involved to recall long sequences, where attention mechanism gate IM portant chunks and force hops in parts of the sequence, is not relevant to the final goal or task.

Note 5:the above evidence shows we don't read sequentially, in fact we interpret characters, words and sentences as a GR Oup. An attention-based or convolutional module perceives the sequence and projects a representation in our mind. We would not to misreading this if we processed this information sequentially! We would stop and notice the inconsistencies! Https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.