Cyclic neural Network (RNN) model and forward backward propagation algorithm

Last Update:2017-03-06 Source: Internet

Author: User

Tags dnn

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In front of us, we talked about the DNN, and the special case of DNN. CNN's model and forward backward propagation algorithms are forward feedback, and the output of the model has no correlation with the model itself. Today we discuss another type of neural network with feedback between output and model: Cyclic neural network (recurrent neural Networks, hereinafter referred to as RNN), it is widely used in natural language processing of speech recognition, handwritten book and machine translation and other fields.

1. RNN Overview

In the Dnn and CNN mentioned earlier, the input and output of the training sample is determined by comparison. But there is a kind of problem dnn and CNN is difficult to solve, that is, the training sample input is a sequential sequence, and the length of the sequence is different, such as time-based sequence: a paragraph of continuous speech, a paragraph of continuous handwritten text. These sequences are long and vary in length and are difficult to be directly split into separate samples to train through DNN/CNN.

And for this kind of problem, RNN is good at comparing. So how did RNN do it? RNN assumes that our sample is sequence-based. For example, from the sequence index 1 to the sequence index $\tau$. For any of these sequence index numbers $t$, the corresponding input is $x^{(t)}$ in the corresponding sample sequence. The $h^{(t)}$ of the model in the $t$ position of the sequence index number is determined by $x^{(t)}$ and the Hidden state $t-1$ ($h^{) t-1 in}$ position. We also have the corresponding model prediction output $o^{(t)}$ in any sequence index number $t$. By predicting the output $o^{(t)}$ and the true output of the training sequence $y^{(t)}$, as well as the loss function $l^{(t)}$, we can train the model in a dnn similar way and then use it to predict the output of some locations in the test sequence.

Let's look at the model of RNN.

2. RNN model

The RNN model has more variants, and the most mainstream RNN model structures are described here:

The left side of the RNN model does not unfold on time, and if expanded by time series, it is the right part of the middle. Let's focus on the diagram on the right.

This image depicts the model RNN near the sequence index number $t$. which

1) $x ^{(t)}$ represents the input of the training sample when the sequence index number $t$. Similarly, $x ^{(t-1)}$ and $x^{(t+1)}$ represent the input of the training sample at the sequence index number $t-1$ and $t+1$.

2) $h ^{(t)}$ represents the hidden state of the model when the sequence index number $t$. $h ^{(t)}$ is determined jointly by $x^{(t)}$ and $h^{(t-1)}$.

3) $o ^{(t)}$ represents the output of the model when the sequence index number $t$. $o ^{(t)}$ is determined only by the model's current hidden state $h^{(t)}$.

4) $L ^{(t)}$ represents the loss function of the model when the sequence index number $t$.

5) $y ^{(t)}$ represents the true output of the training sample sequence when the sequence index number $t$.

6) $U, w,v$ These three matrices are the linear relationship parameters of our model, which are shared across the RNN network, which is very different from DNN. Also because it is shared, it embodies the idea of "circular feedback" of the RNN model.

3. RNN Forward Propagation algorithm

With the above model, RNN's forward propagation algorithm is easy to get.

For any sequence index number $t$, we hide the state $h^{(t)}$ by $x^{(t)}$ and $h^{(t-1)}$ get: $ $h ^{(t)} = \sigma (z^{(t)}) = \sigma (ux^{(t)} + wh^{(t-1)} +b) $$

Where $\sigma$ is the RNN activation function, generally $tanh$, $b $ for the linear relationship bias.

The expression for the output $o^{(t)}$ of the model when the sequence index number $t$ is simple: $ $o ^{(t)} = vh^{(t)} +c $$

At the end of the sequence index number $t$ Our predictive output is: $$\hat{y}^{(t)} = \sigma (o^{(t)}) $$

Usually because RNN is the classification model of the recognition class, the above activation function is generally softmax.

By means of the loss function $l^{(t)}$, such as the logarithmic likelihood loss function, we can quantify the loss of the model at its current position, i.e. the gap between $\hat{y}^{(t)}$ and $y^{(t)}$.

4. Derivation of RNN inverse propagation algorithm

With the foundation of RNN forward propagation algorithm, it is easy to deduce the flow of rnn inverse propagation algorithm. The idea of the RNN inverse propagation algorithm is the same as that of the DNN, that is, by means of the gradient descent method, a suitable RNN model parameter $u,w,v,b,c$ is obtained. Since we are based on time-reverse propagation, the reverse propagation of RNN is sometimes called BPTT (back-propagation through times). Of course, the BPTT and DNN here are also very different, that is, all of the $u,w,v,b,c$ in the sequence are shared, and we update the same parameters when we reverse the propagation.

To simplify the description, the loss function here is the logarithmic loss function, the output activation function is the Softmax function, and the activation function of the hidden layer is the Tanh function.

For RNN, because we have a loss function at every position in the sequence, the final loss $l$ is: $ $L = \sum\limits_{t=1}^{\tau}l^{(t)}$$

The gradient calculation of $v,c,$ is relatively simple: $$\frac{\partial l}{\partial c} = \sum\limits_{t=1}^{\tau}\frac{\partial l^{(t)}}{\partial c} = \ Sum\limits_{t=1}^{\tau}\frac{\partial l^{(t)}}{\partial o^{(t)}} \frac{\partial o^{(t)}}{\partial c} = \sum\limits_{t =1}^{\tau}\hat{y}^{(t)}-y^{(t)}$$$$\frac{\partial l}{\partial V} =\sum\limits_{t=1}^{\tau}\frac{\partial L^{(t)}}{ \partial v} = \sum\limits_{t=1}^{\tau}\frac{\partial l^{(t)}}{\partial o^{(t)}} \frac{\partial o^{(t)}}{\partial V} = \ Sum\limits_{t=1}^{\tau} (\hat{y}^{(t)}-y^{(t)}) (H^{(t)}) ^t$$

But the gradient calculation of $w,u,b$ is more complicated. From the RNN model, it can be seen that in reverse propagation, the gradient loss in a sequence position T is determined by the gradient loss corresponding to the output of the current position and the gradient loss of the sequence index position $t+1$. For the gradient loss of $w$ in a certain sequence position t needs to reverse-propagate the calculation of step-by-step. We define the gradient for the hidden state of the $t$ position of the sequence index: $$\delta^{(t)} = \frac{\partial l^{(t)}}{\partial h^{(t)}}$$

This way we can push $\delta^{(t)}$ like DNN from $\delta^{(t+1)} $. $$\delta^{(t)} =\frac{\partial l^{(t)}}{\partial o^{(t)}} \frac{\partial o^{(t)}}{\partial h^{(t)}} + \frac{\partial L^ {(t)}} {\partial h^{(t+1)}}\frac{\partial h^{(t+1)}}{\partial h^{(t)}} = V^t (\hat{y}^{(t)}-y^{(t)}) + w^t\delta^{(t+1)}diag ( (H^{(t+1)}) ^2) $$

For $\delta^{(\tau)} $, there is: $$\delta^{(\tau)} =\frac{\partial l^{(\tau)}}}{\partial because there is no other sequence index behind it \partial o^{(\tau)}}{\partial h^{(\tau)}} = V^t (\hat{y}^{(\tau)}-y^{(\tau)}) $$

With $\delta^{(t)} $, it's easy to calculate $w,u,b$, here's a $w,u,b$ gradient expression: $$\frac{\partial l}{\partial W} = \sum\limits_{t=1}^{\tau}\frac {\partial l^{(t)}} {\partial W} = \sum\limits_{t=1}^{\tau}\frac{\partial l^{(t)}}{\partial h^{(t)}} \frac{\partial h^{(t)}}{\partial W} = \ Sum\limits_{t=1}^{\tau}diag (h^{(t)}) ^2) \delta^{(t)} (h^{(t-1)}) ^t$$$$\frac{\partial l}{\partial b} = \sum\limits _{t=1}^{\tau}\frac{\partial l^{(t)}}{\partial B} = \sum\limits_{t=1}^{\tau}\frac{\partial L^{(t)}}{\partial h^{(t)}} \frac{\partial h^{(t)}}{\partial B} = \sum\limits_{t=1}^{\tau}diag (n (h^{(t)}) ^2) \delta^{(t)}$$$$\frac{\partial L}{ \partial U} = \sum\limits_{t=1}^{\tau}\frac{\partial l^{(t)}}{\partial U} = \sum\limits_{t=1}^{\tau}\frac{\partial L^{ (t)}} {\partial h^{(t)}} \frac{\partial h^{(t)}}{\partial U} = \sum\limits_{t=1}^{\tau}diag (n (h^{(t)}) ^2) \delta^{(t)} (x^{ (t)}) ^t$$

In addition to the gradient expression, RNN's inverse propagation algorithm and DNN are not very different, so here is no longer repeated summary.

5. RNN Summary

The general RNN model and forward backward propagation algorithm are summarized. Of course, some of the RNN models will be somewhat different, the natural forward-to-back propagation of the formula will be somewhat dissimilar, but the principle is basically similar.

Rnn Although in theory can be very beautiful to solve the training of sequence data, but it also like DNN as the gradient disappears when the problem, when the sequence is very long problems especially serious. Therefore, the above RNN model can not be used directly in the field of application. In the field of speech recognition, handwritten book and machine translation and other NLP fields, it is a special case lstm based on RNN model, and we will discuss LSTM model in the next article.

(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])

Resources:

1) Neural Networks and deep learning by by Michael Nielsen

2) deep learning, book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

3) Ufldl Tutorial

4) cs231n convolutional neural Networks for Visual recognition, Stanford

Cyclic neural Network (RNN) model and forward backward propagation algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More