The previous article introduced the working principle of RNN and its application in image labeling, this article introduces RNN variant lstm.
To know why there are lstm, first of all to see what the RNN problem. RNN due to the problem of activation function and its structure, there is a phenomenon of gradient disappearance, which causes
(1) Network structure can not be too deep, or the gradient of the deep network may be basically ignored, did not play any role in vain to increase the training time.
(2) can only form short-term memory, can not form long-term memory. Because the gradient is reduced by layer, only the adjacent layer gradient will be less than the difference, so the information memory is more close to the more distant information memory poor.
Now let's see how lstm solves this problem:
All RNN have a chain-form of repetitive neural network modules. In the standard RNN, this repeating module has only a very simple structure, such as a tanh layer.
Precisely because the Tanh derivative is less than 1, the gradient appears to be getting smaller.
Lstm through a more granular calculation of the time node in each step to overcome the gradient problem, rnn a state information with a calculation update, lstm with four times the calculation to update, that is, "four gates."
Originally I was called the H state, but in this new structure, H called the state is not quite appropriate, there is a new variable C is the real state, as if most of the data is called C cell state, as for H, I do not know how to call it well, it is indeed in the dissemination of state information, However, it is optional, because the value of H can be obtained by C completely, and there is no problem with explicit delivery.
In Rnn, $x _t\in \mathbb{r}^{d}, H_t\in \mathbb{r}^h, W_x\in\mathbb{r}^{h\times D}, W_h\in\mathbb{r}^{h\times H},b\in\ mathbb{r}^{h}$.
In Lstm, $x _t\in \mathbb{r}^{d}, H_t\in \mathbb{r}^h, W_x\in\mathbb{r}^{4h\times D}, W_h\in\mathbb{r}^{4h\times H},b\in \mathbb{r}^{4h}$.
The first step is still the same, $a \in\mathbb{r}^{4h}$, $a =w_xx_t + w_hh_{t-1}+b$,rnn get a directly can be activated as the next state, and Lstm got four output.
$$
\begin{align*}
i = \sigma (a_i) \hspace{2pc}
f = \sigma (A_f) \hspace{2pc}
o = \sigma (a_o) \hspace{2pc}
g = \tanh (A_g)
\end{align*}
$$
I,f,o,g are called input doors, forgotten doors, output doors, blocking doors, $i, f,o,g\in\mathbb{r}^h$.
$$
C_{t} = F\odot C_{t-1} + I\odot g \hspace{4pc}
h_t = O\odot\tanh (c_t)
$$
Here's a look at these formulas:
"Forgetting" can be understood as "how much before the content", the essence is only output (0,1) fractional sigmoid function and the multiplication of the Pink circle, lstm network after learning to decide how much of the network to remember the content of the previous.
The next step is to determine what new information is stored in the cell state. There are two sections. First, the sigmoid layer is called "Input gate layer" to determine what value we are going to update. A tanh layer is then called a "blocking gate layer" to create a new candidate value vector to be added to the state. Next, we'll talk about these two messages to produce an update of the status.
Finally, we need to determine what value to output. This output will be based on the new cell status we get and the "Output gate" information.
With the forgotten door and the input door, in the derivation of H, the derivative of this time is not constant less than 1, so the problem of overcoming the gradient disappears. (about RNN and lstm gradients can be clearly seen in the code to see the contrast, of course, can also be manually directly to the gradient, can also be found in the lstm h gradient is not constant less than 1).
forward calculation and inverse gradient findingdefLstm_step_forward (x, Prev_h, Prev_c, Wx, Wh, b):"""Forward Pass for a single timestep of an LSTM. The input data has dimension D, the hidden state have dimension H, and we use a minibatch size of N. Inputs:-x:input data, of shape (n, D)-prev_h:previous hidden state, of shape (n, h)-prev_c:previous cell State, of shape (N, H)-wx:input-to-hidden weights, of shape (D, 4H)-wh:hidden-to-hidden weights, of shape (H, 4H)-b:biases, of shape (4H,) Returns a tuple of:-Next_h:next hidden State, of shape (N, h)-Next_c:ne XT cell State, of shape (N, H)-cache:tuple of the values needed for backward pass. """Next_h, Next_c, Cache=None, none, none############################################################################# #Todo:implement The forward pass for a single timestep of an LSTM. # #Want to use the numerically stable sigmoid implementation above. # #############################################################################H=Wh.shape[0] a= Np.dot (x, Wx) + Np.dot (Prev_h, Wh) + b#(1)i = sigmoid (a[:, 0:h])#(2-5)f = sigmoid (a[:, h:2*H]) o= Sigmoid (a[:, 2*h:3*H]) G= Np.tanh (a[:, 3*h:4*H]) Next_c= f * Prev_c + I * g#(6)Next_h = O * Np.tanh (Next_c)#(7)Cache=(i, F, O, G, X, Wx, Wh, Prev_c, Prev_h,next_c)returnNext_h, Next_c, CachedefLstm_step_backward (Dnext_h, Dnext_c, cache):"""backward pass for a single timestep of an LSTM. Inputs:-Dnext_h:gradients of next hidden state, of the shape (N, h)-dnext_c:gradients of next cell state, of shape (n, H)-cache:values from the forward pass Returns a tuple of:-Dx:gradient of input data, of shape (n, D) -Dprev_h:gradient of previous hidden state, of the shape (n, h)-dprev_c:gradient of previous cell state, of shape (n , H)-dwx:gradient of Input-to-hidden weights, of shape (D, 4H)-dwh:gradient of Hidden-to-hidden weights, of SH Ape (H, 4H)-db:gradient of biases, of shape (4H,)"""DX, DH, DC, dWx, DWH, DB=None, none, none, none, none, none############################################################################# #todo:implement The backward pass for a single timestep of an LSTM. # # # #hint:for sigmoid and tanh you can compute local derivatives in terms of # #The output value from the nonlinearity. # #############################################################################I, F, O, G, X, Wx, Wh, Prev_c, Prev_h,next_c=Cache Do=dnext_h*Np.tanh (next_c) Dnext_c+=o* (1-np.tanh (next_c) **2) *Dnext_h Di,df,dg,dprev_c=dnext_c*(g,prev_c,i,f) da=np.hstack ([i* (1-i) *di,f* (1-f) *df,o* (1-o) *do, (1-g*g) *DG]) DX=Np.dot (da,wx.t) dWx=Np.dot (x.t,da) Dprev_h=Np.dot (da,wh.t) dWh=Np.dot (PREV_H.T,DA) DB=np.sum (da,axis=0)returnDX, Dprev_h, Dprev_c, dWx, dWh, DBdefLstm_forward (x, H0, Wx, Wh, b):""" Forward pass for a LSTM over an entire sequence of data. We assume an input sequence composed of T vectors, each of dimension D. The LSTM uses a hidden size of H, and we work over a minibatch containing N sequences. After running the LSTM forward, we return to the hidden states for all timesteps. Note that the initial cell is passed as input and the initial cell is set to zero. Also Note the cell state is not returned; It is a internal variable to the LSTM and are not accessed from outside. Inputs:-X:input data of shape (n, T, D)-h0:initial hidden State of shape (n, H)-Wx:weights for input-to- Hidden connections, of shape (D, 4H)-Wh:weights for Hidden-to-hidden connections, of shape (H, 4H)-B:biases of Shape (4H,) Returns a tuple of:-H:hidden states for all timesteps of all sequences, of shape (N, T, h)-Cach E:values needed for the backward pass. """h, Cache=None, none############################################################################# #Todo:implement The forward pass for a LSTM over an entire timeseries. # #You should use the Lstm_step_forward function so you just defined. # #############################################################################N,t,d=X.shape H=h0.shape[1] H=Np.zeros ((n,t,h)) cache={} prev_h=H0 Prev_c=Np.zeros ((n,h)) forTinchRange (T): XT=x[:,t,:] next_h,next_c,cache[t]=Lstm_step_forward (xt,prev_h,prev_c,wx,wh,b) Prev_h=Next_h Prev_c=Next_c h[:,t,:]=Prev_hreturnh, CachedefLstm_backward (DH, cache):"""backward pass for a LSTM over an entire sequence of data.] Inputs:-Dh:upstream gradients of hidden states, of shape (N, T, H)-cache:values from the forward pass Retur NS A tuple of:-dx:gradient of input data of shape (n, T, D)-dh0:gradient of initial hidden state of shape (n, H)-Dwx:gradient of Input-to-hidden weight matrix of shape (D, 4H)-dwh:gradient of Hidden-to-hidden weight Matr IX of shape (H, 4H)-db:gradient of biases, of shape (4H,)"""DX, Dh0, dWx, dWh, DB=None, none, none, none, none############################################################################# #todo:implement The backward pass for a LSTM over an entire timeseries. # #You should use the Lstm_step_backward function so you just defined. # #############################################################################N, T, H =Dh.shape D= Cache[0][4].shape[1] Dprev_h=Np.zeros ((N, H)) Dprev_c=Np.zeros ((N, H)) DX=Np.zeros ((N, T, D)) Dh0=Np.zeros ((N, H)) DWx= Np.zeros (D,H)) DWh= Np.zeros ((H,H)) DB= Np.zeros ((H,)) forTinchRange (t): t= t-1-T Step_cache=Cache[t] Dnext_h= Dh[:,t,:] +Dprev_h Dnext_c=Dprev_c dx[:,t,:], Dprev_h, Dprev_c, Dwxt, DWHT, DBT=Lstm_step_backward (Dnext_h, Dnext_c, Step_cache) dWx, dWh, DB= Dwx+dwxt, DWH+DWHT, db+DBT Dh0=Dprev_hreturnDX, Dh0, dWx, dWh, DB
The rest of the code is almost indistinguishable from yesterday, and yesterday the code left a problem, and now it should be easy to answer the HHH.
Reference:
80572098
Http://cs231n.github.io
Image callout Python Implementation-lstm Chapter