Recurrent neural Networks Tutorial, part 1–introduction to Rnns
Recurrent neural Networks (Rnns) is popular models that has shown great promise in many NLP tasks. But despite their recent popularity I ' ve only found a limited number of resources which throughly explain how Rnns work, an D how to implement them. That's what's this tutorial was about. It ' s a multi-part series in which I ' m planning to cover the following:
- Introduction to Rnns (this post)
- Implementing a RNN using Python and Theano
- Understanding the BackPropagation Through time (BPTT) algorithm and the vanishing gradient problem
- From Rnns to LSTM Networks
As part of the tutorial we'll implement A recurrent neural network based language model. The applications Of language models is Two-fold:first, it allows us to score arbitrary sentences based on how Likel Y they is to occur in the real world. This gives us a measure of grammatical and semantic correctness. Such models is typically used as part of the machine translation systems. Secondly, a language model allows us to generate new text (I-think that ' s the much cooler application). Training a language model on Shakespeare allows us-generate shakespeare-like text. this fun post by ; Andrew karpathy demonstrates what Character-level language models Based on Rnns is capable of.
I ' m assuming that's somewhat familiar with basic neural Networks. If you're not, want to head through implementing A neural Network from Scratch, which guides your through the idea s and implementation behind non-recurrent networks.
What is Rnns?
The idea behind Rnns was to make use of sequential information. In a traditional neural network we assume it all inputs (and outputs) is independent of each other. but for many T Asks that's a very bad idea. If you want to predict the next word in a sentence you better know which words came before It. rnns are cal led Recurrent because they perform the same task for every element of a sequence, with the OU Tput being Depended on the previous computations. Another-think about Rnns was that they had a "memory" which captures information about what had been calculated so Far. In theory Rnns can make use of information in arbitrarily long sequences, but in practice they is Lim ited to looking back only a few steps (more on this later). Here is What a typical RNN looks like:
A Recurrent neural Network and the unfolding in time of the computation involved in its forward computation. Source:nature
The above diagram shows a RNN being unrolled (or unfolded) into a full network. By unrolling we simply mean, we write out the network for the complete sequence. For example, if the sequence we care on is a sentence of 5 words, the network would being unrolled into a 5-layer neural n Etwork, one layer for each word. The formulas that govern the computation happening in a RNN is as follows:
- is the-input at time step. For example, could is a one-hot vector corresponding to the second word of a sentence.
- Is the hidden state at time step. It's The "Memory" of the network. is calculated based on the previous hidden, and the input at the current step:. The function usually is a nonlinearity such as Tanh or ReLU. , which is required to calculate the first hidden state, was typically initialized to all zeroes.
- is the output at step. For example, if we wanted to predict the next word in a sentence it would is a vector of probabilities across our Vocabula Ry.
There is a few things to note here:
- You can think of the hidden state as the memory of the network. Captures information about what happened in all the previous time steps. The output at step was calculated solely based on the memory at time. As briefly mentioned above, it ' s a bit more complicated in practice because typically can ' t capture information from too Many time steps ago.
- Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the same parameters (above) across all steps. This reflects the fact, we is performing the same task at each step, just with different inputs. This greatly reduces the total number of parameters we need to learn.
- The above diagram have outputs at the step of each time, but depending on the task is not being necessary. For example, when predicting the sentiment of a sentence we are only care is about the final output, not the sentiment after Each word. Similarly, we may not be need inputs at each time step. The main feature of an RNN are its hidden state, which captures some information about a sequence.
What can Rnns do?
Rnns has shown great success in many NLP tasks. At this point I should mention so the most commonly used type of Rnns is Lstms, which is much better at capturing long -term dependencies than Vanilla Rnns is. But don ' t worry, LSTMS is essentially the same thing as the RNN we'll develop in this tutorial, they just has a differ ENT-On-the-computing the hidden state. We ' ll cover Lstms in more detail in a later post. Here is some example applications of Rnns in NLP (by non means an exhaustive list).
Language Modeling and generating Text
Given a sequence of words we want to predict the probability of each word Given the previous words. Language Models allow us to measure how likely a sentence are, which is an important input for machine translation (since H Igh-probability sentences is typically correct). A side-effect of being able to predict the next word is so we get a generative model, which allows us to genera Te new text by sampling from the output probabilities. And depending on how our training data are we can generate all kinds of stuff. In Language Modeling We input is typically a sequence of the words (encoded as one-hot vectors for example), and our output I s the sequence of predicted words. When training the network we set since we want the output at step to be the actual next word.
Papers about Language Modeling and generating Text:
- Recurrent neural network based language model
- Extensions of recurrent neural network based language model
- Generating Text with recurrent neural Networks
Machine translation
Machine translation are similar to language modeling in, we input is a sequence of words with our source language (e.g. German). We want to output a sequence of words in our target language (e.g. 中文版). A key difference is which our output has starts after we had seen the complete input, because the first word of our trans lated sentences may require information captured from the complete input sequence.
RNN for machine translation. Image source:http://cs224d.stanford.edu/lectures/cs224d-lecture8.pdf
About papers translation:
- A Recursive Recurrent neural Network for statistical machine translation
- Sequence to Sequence learning with neural Networks
- Joint Language and translation Modeling with recurrent neural Networks
Speech recognition
Given an input sequence of acoustic signals from a sound wave, we can predict a sequence of phonetic segments together wit H their probabilities.
Papers about Speech Recognition:
- Towards end-to-end Speech recognition with recurrent neural Networks
Generating Image Descriptions
Together with convolutional neural Networks, Rnns has been used as part of a model togenerate descriptions for unlabeled Images. It ' s quite amazing how well this seems to work. The combined model even aligns the generated words with features found in the images.
Deep visual-semantic alignments for generating Image descriptions. source:http://cs.stanford.edu/people/karpathy/deepimagesent/
Training Rnns
Training a RNN is similar to Training a traditional neural Network. We also use the backpropagation algorithm, but with a little twist. Because the parameters is shared by any time steps in the network and the gradient at each output depends not only on the CA Lculations of the current time step, but also the previous time steps. For example, in order to calculate the gradient at we would need to backpropagate 3 steps and sum up the G Radients. This is called BackPropagation Through time (BPTT). If this doesn ' do a whole lot of sense yet, don ' t worry, we'll have a whole post on the gory details. For now, just is aware of the fact that Vanilla Rnns trained with bptt have difficulties learning long-term Depe Ndencies (e.g dependencies between steps that is far apart) due to what is called the vanishing/exploding grad Ient problem. There exists some machinery to deal with these problems, and certain types of Rnns (like Lstms) were Specificall YDesigned to get around them.
RNN Extensions
Over the years researchers has developed more sophisticated types of Rnns for deal with some of the shortcomings of the VA Nilla RNN model. We'll cover them in + detail in a later post, but I want the section to serve as a brief overview so it's FA Miliar with the taxonomy of models.
bidirectional Rnns is based on the "the" the "output at time" may not "only" depend on the previous elements in The sequence, but also to the future elements. For example, to predict a missing word in a sequence you want to look at both the left and the right context. Bidirectional Rnns is quite simple. They is just, Rnns stacked on top of each of the other. The output is then computed based in the hidden state of both Rnns.
Deep (bidirectional) Rnns is similar to bidirectional Rnns, and only then we have multiple layers per time step. In practice the gives us a higher learning capacity (but we also need a lot of training data).
LSTM Networks is quite popular these days and we briefly talked about them above. Lstms don ' t has a fundamentally different architecture from Rnns, but they use a different function to compute the hidden State. The memory in Lstms is called cells and you can think of them as black boxes so take as input the previous STA Te and current input. Internally these cells decide what's keep in (and "to" erase from) memory. They then combine the previous state, the current memory, and the input. In turns out that these types of units is very efficient at capturing long-term dependencies. Lstms can quite confusing in the beginning and if you ' re interested in learning more this post have an excellent Explana tion.
Conclusion
So far so good. I hope you ' ve gotten a basic understanding of what Rnns is and what they can do. In the next post we'll implement a first version of our language model RNN using Python and Theano. Leave questions in the comments!
Recurrent neural Networks Tutorial, part 1–introduction to Rnns