Recurrent neural network (recurrent neural networks)

Source: Internet
Author: User

Reference:alex Graves [supervised Sequence labelling with Recurrentneural Networks]

Alex is the most famous variant of Rnn, lstm inventor Jürgen Schmidhuber Gaotu, is now joined University of Toronto, apprentice Hinton.

Statistical language model and Sequence learning 1.1 language model based on frequency statistics

The most famous language model in the field of NLP is N-gram.

It is based on the Markov hypothesis, of course, which is a 2-gram (Bi-gram) Model:

The probability that any word $w_{i}$ appears is only related to the word $w_{i-1}$ in front of it.

Migrating to N-gram, it becomes:

In a sentence, the probability of the occurrence of a word's t, and its first n words are related:

$P (w_{t}| W_{t-n}....,w_{t-1},w_{t+1}....,w_{t+n}) $

The early method of computing Bi-gram, as Dr. Wu in the "Beauty of mathematics" in the popular science, the use of Word frequency statistical method:

$P (w_{i}| W_{i-1}) =\frac{p (W_{i-1},w_{i})}{p (w_{i-1})}\approx \frac{cnt (W_{i-1},w_{i})}{cnt (w_{i-1})}$

Only need to count the $w_{i}$, $W _{i-1}$ with the frequency of appearing together can be.

It looks really simple, very mathematical beauty. Of course, as a popular science books, it will not tell you how harmful this method is.

Implementation, you can use the following two algorithms:

①KMP: Put $w_{i}$, $W _{i-1}$ two words together, run once the text string.

②ac automaton: Same stitching, but pre-spell all the pattern string, input AC automaton, just run once text string.

But if you are an ACM player, you should have a deep understanding of the AC automaton, which is simply a memory killer.

The two harm, take it light. Obviously, the actual use of the time, will not consider KMP, preferential choice of space-time AC automata algorithm.

In [cs224d Lecture7], Socher referred to the state-of-art results of the N-gram frequency statistic method[Heafield]

It's scary to look at the abstract:

Using one machine with the RAM for 2.8 days, we built a unpruned model on 126 billion tokens.

is still built on the MapReduce . this way of eating hardware, I have to say, is really bad enough.

1.2 A language model based on neural network

Neural Network Language Model (NNLM), was first formally presented in [Bengio03].

Bengio uses a classic Feedforward network to train the N-gram difference in that the input layer is available for training.

The input layer is finally trained as word Vector, which is mostly called word embeddings before [Mikolov13] presents Word2vec.

The specific method:

① for each word, build $| The vector parameter of the n*dim|$. This is different from the simplified approach of Word2vec.

Word2vec cancels training The word order message, so the vector size is $| dim|$.

② get a sentence, sentence length is T, using the index of words, combining out $| The input matrix of the t,n*dim|$.

③ like normal nn, softmax error, BP propagation, update.

The NNLM method, a light-weight reproducible N-gram model, requires less memory. And does not need to be smoothed, such as [Katz Backoff].

1.3 Sequence Learning

Just as the naive Bayes hypothesis is, the Markov hypothesis is also a bad approximation.

For a word $w_{i}$, only the first n words, many times is not grasp the focus.

The best solution, of course, is for a word $w_{i}$, covering all the words in front of it.

Simulate the dynamic programming principle and construct a dynamic sequence model. This requires recurrent neural Network (RNN) to achieve.

RNN is usually translated into cyclic neural networks, and its similar dynamic programming principles can also be translated into sequential recurrent neural networks.

Of course there are structural recurrent neural networks RNN (Recursive Neural Network), the use of low frequency, decline.

Usually RNN refers to a sequential recursive rnn.

RNN structure and update 2.1 classic: Elman ' s simple recurrent Networks (SRN)

J. L. Elman proposed SRN is the simplest variant of the RNN system, which adds a sequential feedback connection only to the FC layer compared to the traditional 2-layer FC Feedforward network.

The left image is an incomplete structure, because the loops of the loop layer are too difficult to draw, including self-loops and cross-loops.

So RNN is usually drawn as a sequential expansion diagram, as shown on the right.

From the timing expansion diagram, it is easy to see that the SRN is compressed together and entered into the current hidden layer at the time of the timing t.

Thus, RNN can be regarded as a graph model with dynamic depth structure, and with the increase of time series, the depth of network becomes larger and deeper, thus it belongs to deep neural network.

2.2 Forward Propagation

Compared to the 2-layer Feedforward network, the only change is that when the timing is $t$, the implicit input consists of two parts:

① the input Layer mapping transformation. (Non-recursive)

When the ② sequence is $t-1$, the activation output of the hidden layer neurons. (Sequential recursion)

The following uses $a_{j}^{t}$ to represent the input of the Neuron J, $b _{j}^{t}$ represents the active output of the Neuron J, which is expressed clearly and ignores bias.

When the input reaches the hidden layer, there are:

$a _{h}^{t}=\sum_{i=1}^{i}w_{ih}x_{i}^{t}+\sum_{h^{'}=1}^{h}w_{h^{'}h}b_{h^{'}}^{t-1}$

The $h^{'}->h$ transform, which can be called a faint layer transformation, brings memory slots to the Feedforward network.

It remembers the full state of the neurons of the time series $[1,t-1]$ before reaching the output layer.

After activating the hidden layer neurons, there are:

$b _{h}^{t}=activation (A_{h}^{t}) $

After the output layer, there are:

$y _{k}^{t}=softmax (W_{hk}b_{h}^{t}) $

2.3 Reverse Propagation

In the same way as BP networks, local gradients are defined:

$\delta _{y}^{t}=\frac{\partial \mathcal{l}}{\partial B_{y}^{t}} \cdot \frac{\partial b_{y}^{t}}{\partial a_{y}^{t}}$

In other words, starting with the likelihood function, the chain tail leads to the previous step of the $\partial w$ .

For the output layer, there are:

$\frac{\partial \mathcal{l}}{\partial w_{hk}}=\delta_{k}^{t}\cdot\frac{\partial a_{j}^{t}}{\partial W_{hk}}$

$\delta_{k}^{t}$ that is, $ (1\{y_{i}=j\}-p (y^{(i)}=j|x;\theta_{j} ) in Softmax, j=1,2....k$

To the faint layer, there are:

$\frac{\partial \mathcal{l}}{\partial w_{h^{'}h}}=\sum_{p=1}^{t}\frac{\partial \mathcal{L}}{\partial b_{k}^{t}}\ Cdot\frac{\partial b_{k}^{t}}{\partial a_{k}^{t}}\cdot\frac{\partial b_{h}^{t}}{\partial b_{h}^{p}}\cdot\frac{\ Partial b_{h}^{p}}{\partial w_{h^{'}h}} \qquad where \quad \frac{\partial b_{h}^{t}}{\partial b_{h}^{p}}=\prod_{j=p+1} ^{t}\frac{\partial b_{h}^{j}}{\partial b_{h}^{j-1}}$

The above style from [cs224d Lecture7], is rnn feet of Clay, need to keep in mind.

Because of the need to multiply, the equivalent of seeking a super-deep neural network structure gradient, will bring serious gradient vanish\exploding. This is described in detail below.

As for why it can be written like that, you can use a simple single neuron, a network without activation function to push:

$a _{3}=wx_{3}+w^{'}a_{2}=....=wx_{3}+w^{'}wx_{2}+ (w^{'}) ^{2}wx_{1}\\\\\frac{\partial a_{3}}{w^{'}}=???? \\\\answer=\frac{\partial a_{3}}{\partial a_{3}}\cdot\frac{\partial a_{3}}{w^{'}}+\frac{\partial a_{3}}{\partial a_ {3}} \cdot\frac{\partial a_{3}}{\partial a_{2}}\cdot\frac{\partial a_{2}}{w^{'}}+\frac{\partial a_{3}}{\partial a_{3}}\ Cdot\frac{\partial a_{3}}{\partial a_{2}}\cdot\frac{\partial a_{2}}{\partial a_{1}}\cdot\frac{\partial a_{1}}{w^{'} }$

Also, remove the back part with a local gradient:

$\delta_{h^{'}}^{t}=\sum_{p=1}^{t}\frac{\partial \mathcal{l}}{\partial b_{k}^{t}}\cdot\frac{\partial b_{k}^{t}}{\ Partial a_{k}^{t}}\cdot\frac{\partial b_{h}^{t}}{\partial B_{h}^{p}} \qquad where \quad \frac{\partial b_{h}^{t}}{\ Partial b_{h}^{p}}=\prod_{j=p+1}^{t}\frac{\partial b_{h}^{j}}{\partial b_{h}^{j-1}}$

Thus, for the hidden layer, there is the famous bpTT update law, as written in the [Alex] book:

$\delta _{h}^{t}=activation^{'} (A_{h}^{t}) \begin{pmatrix}\sum_{k=1}^{k}\delta _{k}^{t}w_{hk}+\sum_{h^{'}=1}^{H}\ Delta _{h^{'}}^{t+1}w_{hh^{'}}\end{pmatrix}$

Of course $\delta _{h^{'}}^{t+1}$ will cross the border, equal to 0.

Here, the most headache is, why the faint layer of the local gradient is dependent on the timing t+1?

This takes one step at a time,$W ^{t}$ not only in the current timing T, as the hidden layer parameters, but also in the timing t+1, as the additional parameters of the faint layer appears.

Thus, the chain rule spreads to the timing T, and the timing t+1, so the local gradient consists of two parts, which is the essence of the bpTT Update method.

Last part:

$\frac{\partial \mathcal{l}}{\partial w_{ih}}=\frac{\partial \mathcal{l}}{\partial a_{h}^{t}}\cdot\frac{\partial A_ {h}^{t}}{\partial W_{ih}}=\delta _{h}^{t}x_{i}^{t}$

RNN and Semantic Analysis

RNN's ancestors were the Hopfield network proposed in 1982.

The Hopfield network was replaced by a 86-year Feedforward network because of the difficulty of implementation, plus the lack of suitable applications.

The 90 's coincided with the decline of neural networks, and the Feedforward MLP was optimization by SVM.

In the represention, the older generation of the CV is still using the Hand-made feature, and the SPEECH&NLP also emphasizes the characteristics of statistics.

The two variants of Rnn,elman&jordan SRN, proposed in 1990, were also quickly ignored because of the lack of suitable practical applications.

After more than 10 years, met the DL craze, RNN was researched the distrubuted represention ability with the mining semantic information.

Finally be taken to do speech and language model aspects of semantic analysis related tasks.

3.1 Memory characteristics

A simple rnn,unfold (unfold) with a length of T is essentially a feedforward network with a depth of T.

All the input information on the sequence, the non-linearity transform's implicit information from the beginning of time, has been retained to the current moment.

From a biological neuroscience perspective, it is the long-term memory (long-term) feature.

Feedforward networks are not omnipotent and, despite their great brilliance on CVs, are not really suitable for solving logic problems.

Prolog once big shine, many people believe that probability can not solve the logic intelligence problem, but was rnn face, such as the following question:

RNN is able to search the key information in the input through long memory.

3.2 Gradient Vanish

The number one problem with deep neural networks is gradient Vanish, especially RNN, which is much deeper than MLP.

★ Mathematical angle: [Bengio94] gives the simple rnn appears gradient vanish reason:

$\left | \prod_{j=p+1}^{t}\frac{\partial b_{h}^{j}}{\partial b_{h}^{j-1}}\right |\leqslant (\beta_{W}\cdot\beta_{h}) ^{t-p} \ Quad where \quad \beta =upperbound$

W, h two parameter matrix, the first to accumulate power, resulting in the upper bound mutation speed fast, either $\rightarrow 0$, or $\rightarrow \infty$

If the introduction of a large number of saturated non-linearity, such as sigmoid (logistic| Tanh), then the most common situation is $gradient \rightarrow 0$

★ Biological Angle:

The term is called long-term memory degraded to short-term memory, can only remember short-term memories.

3.3 RNNLM

Although simple RNN has many flaws, but short-term memory after all better than nothing.

[MIKOLOV10] first proposed to use RNN to do LM, but did not use Word embedings.

RNNLM from Sentence-level, a sentence as a sequence, one by one, word to push the timing.

3.4 RNN for Speech understanding

[Mesnil13] Then the RNN with the recent fire of word embedings combined together.

This paper is Bengio group Mesnil and in Microsoft Internship and Redmond Research Institute voice Field two Daniel Xiaodong He, Li Deng cooperation.

Visual inspection is the Theano of the instructor in Ms Communication. (OS: See you're still in the formula pushing gradient, ah haha ha)

3.4.1 Word embeddings

Back to see [Mikolov13] word2vec,13 year began really people play the word vector.

[Mesnil13] summarizes some of the benefits of word vectors:

★ With a smaller dimension vector, purify the word's n-dimensional Euclidean space information, commonly known as dimensionality reduction.

★ You can pre-training some semantic syntax information on large corpus such as wikis,

Then according to the actual task fine-tuning, conforms to the deep learning principle.

★ Greatly improve generalization.

3.4.2 Context Window

Recurrent neural network (recurrent neural networks)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.