1. Vanish of gradient
The gradient of RNN's error relative to a point-in-time t is:
\ (\frac{\partial e_t}{\partial w}=\sum_{k=1}^{t}\frac{\partial e_t}{\partial y_t}\frac{\partial y_t}{\partial h_i}\ Frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial w}\),
where \ (h\) is the output of hidden node, \ (y_t\) is the network at t Time Output,\ (w\) is hidden nodes to hidden nodes weight, and \ (\frac{\partial h_t}{\ Partial h_k}\), the derivative in the time period [k,t] on the chain expansion, this time may be very long, will cause vanish or explosion gradiant. Expand \ (\frac{\partial h_t}{\partial h_k}\) along the time: \ (\frac{\partial h_t}{\partial h_k}=\prod_{j=k+1}^{t}\frac{\partial H_j }{\partial h_{j-1}}=\prod_{j=k+1}^{t}w^t \times diag [\frac{\partial\sigma (H_{j-1})}{\partial h_{j-1}}]\). What is a diag matrix in the upper-style? Let me give you an example and you'll see. Suppose now to solve \ (\frac{\partial h_5}{\partial h_4}\), memory forward propagation when \ (h_5\) is how to get: \ (H_5=w\sigma (h_4) +w^{hx}x_4\), then \ (\frac{\ Partial h_5}{\partial h_4}=w\frac{\partial \sigma (h_4)}{\partial h_4}\), notice that \ (\sigma (h_4) \) and \ (h_4\) are all vectors, so \ (\frac{\ Partial \sigma (h_4)}{\partial h_4}\) is the Jacobian matrix, namely: \ (\frac{\partial \sigma (h_4)}{\partial h_4}=\) \ (\begin{bmatrix} \ Frac{\partial\sigma_1 (h_{41})}{\partial h_{41}}&\cdots&\frac{\partial\sigma_1 (h_{41})}{\partial H_{4D}} \\ \vdots&\cdots&\vdots \\ \frac{\partial\sigma_d (h_{4d})}{\partial h_{41}}&\cdots&\ Frac{\partIal\sigma_d (h_{4d})}{\partial h_{4d}}\end{bmatrix}\), obviously, the value on the non-diagonal is 0. This is because sigmoid logistic function \ (\sigma\) is a element-wise operation.
The subsequent derivation of the vanish or explosion gradiant process is very simple, I will not write, please refer to the formula in Http://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf ( 14) the next part.
2. Sum derivatives of nodes
Not to be continued ...
Several difficulties of RNN (recurrent neural Network)