dl4nlp--Neural Network (b) Cyclic neural network: BPTT algorithm steps finishing; gradient vanishing and gradient explosion

Source: Internet
Author: User

Online there are many simple rnn bptt algorithm derivation. Let's arrange it with your own marks.

I had a habit of using the subscript to indicate the sample number, which can no longer be represented here, because the subscript needs to be used to represent the moment.

The typical simple RNN structure is as follows:

Image source: [3]

To arrange a sign:

Input sequence $\TEXTBF x_{(1:t)} = (\TEXTBF x_1,\textbf x_2,..., \TEXTBF x_t) $, the value of each moment is a one-hot column vector with a dimension that is the size of the thesaurus;

Tag sequence $\TEXTBF y_{(1:t)} = (\TEXTBF y_1,\textbf y_2,..., \TEXTBF y_t) $, the value of each moment is a one-hot column vector with a dimension that is the size of the thesaurus;

Output sequence $\HAT{\TEXTBF y}_{(1:t)} = (\HAT{\TEXTBF y}_1,\hat{\textbf y}_2,..., \HAT{\TEXTBF y}_t) $, the value of each moment is a column vector with a dimension that is the size of the thesaurus;

The hidden layer output $\TEXTBF h_t\in\mathbb r^h$;

Hidden layer input $\TEXTBF s_t\in\mathbb r^h$;

Outputs of the output layer $\TEXTBF z_t$ before Softmax.

(a) BPTT

So for simple rnn, the forward propagation process is as follows (omitting the bias):

$$\TEXTBF S_T=U\TEXTBF H_{T-1}+W\TEXTBF x_t$$

$$\TEXTBF h_t=f (\TEXTBF s_t) $$

$$\TEXTBF Z_T=V\TEXTBF h_t$$

$$\HAT{\TEXTBF Y}_t=\text{softmax} (\TEXTBF z_t) $$

Where $f $ is the activation function. Note that the three weights matrix is shared on the time dimension. This can be understood as: Every moment is performing the same task, so it is shared.

Since every moment has output $\HAT{\TEXTBF y}_t$, then correspondingly, every moment will have a loss. Remember $t $ time loss for $E _t$, then for sample $\TEXTBF x_{(1:t)}$, the loss $E $ for

$ $E =\sum_{t=1}^te_t$$

Using the cross-entropy loss function, then

$ $E _T=-\TEXTBF Y_T^{\TOP}\LOG\HAT{\TEXTBF y}_t$$

One, $E $ to $V the gradient of $

The following first takes the gradient $E $ to $V $. According to chain rule:$\dfrac{\partial \textbf z}{\partial \textbf x}=\dfrac{\partial \textbf y}{\partial \TEXTBF x}\ Dfrac{\partial \TEXTBF z}{\partial \textbf y}$, $\dfrac{\partial z}{\partial x_{ij}}= (\dfrac{\partial z}{\partial\ TEXTBF y}) ^{\TOP}\DFRAC{\PARTIAL\TEXTBF y}{\partial x_{ij}}$, with

$$\frac{\partial e_t}{\partial v_{ij}}= (\frac{\partial E_T}{\PARTIAL\TEXTBF z_t}) ^{\TOP}\FRAC{\PARTIAL\TEXTBF z_t}{ \partial v_{ij}}$$

This is actually the same as BP, the previous item is equivalent to the error term $\delta$, the latter equals

$$\frac{\partial \TEXTBF z_t}{\partial v_{ij}}=\frac{\partial v\textbf h_t}{\partial V_{ij}}= (0,..., [\TEXTBF h_t]_j, ..., 0) ^{\top}$$

Only the first $i $ line is nonzero, $[\TEXTBF h_t]_j$ refers to the first $j $ element of $\TEXTBF h_t$. Refer to the end of the previous blog to see that the previous item equals

$$\frac{\partial E_T}{\PARTIAL\TEXTBF Z_T}=\HAT{\TEXTBF Y}_T-\TEXTBF y_t$$

(There are some tricks for solving it.) If you use the ordinary solution to push the formula, you can refer to [6]. )

So there are

$$\frac{\partial e_t}{\partial V_{IJ}}=[\HAT{\TEXTBF Y}_T-\TEXTBF Y_T]_I[\TEXTBF h_t]_j$$

thereby having

$$\frac{\partial e_t}{\partial v}= (\HAT{\TEXTBF y}_t-\textbf y_t) \TEXTBF h_t^{\top}= (\HAT{\TEXTBF y}_t-\textbf y_t) \ Otimes \TEXTBF h_t$$

The outer product of a vector is a special case of the Kronecker product of a matrix under a vector. So

$$\frac{\partial e}{\partial v}=\sum_{t=1}^t (\HAT{\TEXTBF y}_t-\textbf y_t) \otimes \TEXTBF h_t$$

Two, $E $ on the gradient of $U $

Continue to find the gradient $E $ to $U $. In seeking $\frac{\partial e_t}{\partial u}$, it is important to note the fact that not only $t the hidden state of the moment is related to the $U $, the hidden state of all previous moments is related to the $U $. So, according to chain rule:

$$\frac{\partial e_t}{\partial U}=\SUM_{K=1}^T\FRAC{\PARTIAL\TEXTBF s_k}{\partial U}\frac{\partial E_t}{\partial\ TEXTBF s_k}$$

The following is used to solve a similar routine: first, a gradient of one element of a matrix is obtained.

$$\frac{\partial e_t}{\partial u_{ij}}=\sum_{k=1}^t (\frac{\partial E_T}{\PARTIAL\TEXTBF s_k}) ^{\top}\frac{\partial \TEXTBF s_k}{\partial u_{ij}}$$

The previous item was first defined as $\delta_{t,k}=\dfrac{\partial E_T}{\PARTIAL\TEXTBF s_k}$, for the latter:

$$\FRAC{\PARTIAL\TEXTBF s_k}{\partial u_{ij}}=\frac{\partial (U\TEXTBF h_{k-1}+w\textbf x_k)}{\partial U_{ij}}= (0, ..., [\TEXTBF h_{k-1}]_j,..., 0) ^{\top}$$

Only the first $i $ line is nonzero, $[\TEXTBF h_{k-1}]_j$ refers to the first $j $ element of $\TEXTBF h_{k-1}$. Now to solve the $\delta_{t,k}=\dfrac{\partial E_T}{\PARTIAL\TEXTBF s_k}$, use the previous article to find $\delta^{(l)}$ routines:

$$\begin{aligned}\delta_{t,k}&=\frac{\partial E_T}{\PARTIAL\TEXTBF s_k}\\&=\frac{\partial \TEXTBF h_k}{\ PARTIAL\TEXTBF s_{k}}\frac{\partial \TEXTBF S_{K+1}}{\PARTIAL\TEXTBF h_{k}}\frac{\partial E_T}{\PARTIAL\TEXTBF s_{k+ 1}}\\&=\text{diag} (f ' (\TEXTBF s_t)) u^{\top}\delta_{t,k+1}\\&=f ' (\TEXTBF s_{k}) \odot (U^{\top}\delta_{t,k+ 1}) \end{aligned}$$

A special case is when $\delta_{t,t}$, there is

$$\begin{aligned}\delta_{t,t}&=\frac{\partial E_T}{\PARTIAL\TEXTBF s_t}\\&=\frac{\partial \TEXTBF h_t}{\ PARTIAL\TEXTBF s_t}\frac{\partial \TEXTBF Z_T}{\PARTIAL\TEXTBF h_t}\frac{\partial E_T}{\PARTIAL\TEXTBF z_t}\\&=\ Text{diag} (f ' (\TEXTBF S_{t})) V^{\top} (\HAT{\TEXTBF Y}_T-\TEXTBF y_t) \\&=f ' (\TEXTBF s_{t}) \odot (V^{\top} (\hat{\ TEXTBF Y}_T-\TEXTBF y_t)) \end{aligned}$$

So

$$\frac{\partial e_t}{\partial U_{IJ}}=\SUM_{K=1}^T[\DELTA_{T,K}]_I[\TEXTBF h_{k-1}]_j$$

$$\frac{\partial e_t}{\partial U}=\SUM_{K=1}^T\DELTA_{T,K}\TEXTBF h_{k-1}^{\top}=\sum_{k=1}^t\delta_{t,k}\otimes\ TEXTBF h_{k-1}$$

So

$$\frac{\partial e}{\partial U}=\SUM_{T=1}^T\SUM_{K=1}^T\DELTA_{T,K}\OTIMES\TEXTBF h_{k-1}$$

(ii) gradient disappears (gradient vanishing) with gradient explosion (gradient exploding)

First of all

$$\frac{\partial e_t}{\partial u}=\frac{\partial \textbf h_t}{\partial u}\frac{\partial \HAT{\TEXTBF y}_t}{\partial \ TEXTBF h_t}\frac{\partial e_t}{\partial \HAT{\TEXTBF y}_t}$$

The $\dfrac{\partial \TEXTBF h_t}{\partial u}$ Here is troublesome because the parameters are shared at various times: $\TEXTBF h_t$ and $\TEXTBF h_{t-1}$, $U $, and $\TEXTBF H_{t -1}$ and $\TEXTBF h_{t-2}$, $U $. Therefore, reference [5], can be written in the following form (note [5] in the forward propagation process and [4], the same as in this article):

$$\frac{\partial e_t}{\partial u}=\sum_{k=1}^t\frac{\partial \textbf h_k}{\partial U}\frac{\partial \TEXTBF h_t}{\ Partial \TEXTBF h_k}\frac{\partial \HAT{\TEXTBF y}_t}{\partial \textbf h_t}\frac{\partial E_t}{\partial \HAT{\TEXTBF y} _t}$$

which

$$\begin{aligned}\frac{\partial \TEXTBF h_t}{\partial \TEXTBF h_k}&=\prod_{i=k+1}^t\frac{\partial \TEXTBF h_i}{\ Partial \TEXTBF h_{i-1}}\\&=\prod_{i=k+1}^t\frac{\partial \TEXTBF s_i}{\partial \TEXTBF h_{i-1}}\frac{\partial F ( \TEXTBF s_i)}{\partial \textbf s_i}\\&=\prod_{i=k+1}^tu^{\top}\text{diag}{f ' (\TEXTBF s_i)}\end{aligned}$$

As can be seen from this equation, when using Tanh or the logistic activation function, because the guide value between 0 to 1, 0 to 1/4, so if the weight matrix $U the norm is not very large, then after $t-k$ times spread, $\dfrac{\partial \TEXTBF The norm of h_t}{\partial \TEXTBF h_k}$ tends to be 0, which leads to the problem of gradient vanishing.

To mitigate gradients, you can use Relu, prelu as an activation function, and initialize $U $ to the unit matrix instead of random initialization.

In other words, although simple rnn theoretically can maintain a long interval between the state of dependence, but in fact can only learn short-term dependency.

This is called the long-term dependency problem, which needs to be mitigated by the LSTM unit.

for the gradient explosion problem, it is usually a relatively simple strategy, such as Gradient clipping: in one iteration, the sum of the squares of each weighted gradient is greater than a certain threshold, and to avoid the weight matrix being updated too quickly, a scaling factor (the threshold divided by the sum of squares) is obtained, multiplying all the gradients by this factor.

Resources:

[1] The lecture notes on neural networks and deep learning

[2] Recurrent neural NETWORKS TUTORIAL, part 3–backpropagation THROUGH time and vanishing gradients

[3] BPTT algorithm derivation

[4] on the difficulty of training RNN

[5] rucurrent nets and LSTM

[6] bpTT derivation of lstm

dl4nlp--Neural Network (b) Cyclic neural network: BPTT algorithm steps finishing; gradient vanishing and gradient explosion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.