-notes for the "Deep Learning book, Chapter Sequence modeling:recurrent and recursive Nets.
Meta Info:i ' d to thank the authors's original book for their great work. For brevity, the figures and text from the original book are used without. Also, many to Colan and Shi for their excellent blog posts on Lstm, from which we use some figures. Introduction
Recurrent neural Networks (RNN) are for handling data.
Rnns share parameters across different positions/index of time/time steps of the sequence, which makes it possible to GE Neralize to examples of different sequence length. RNN is usually a better alternative to position-independent classifiers and sequential models so treat each position ferently.
How does a RNN share parameters? Each member of the "output is produced using the" Same update rule applied to the previous outputs. Such update is often a (same) NN layer, as the "a" in the figure below (Fig from Colan).
Notation:we refer to Rnns as operating on a sequence this contains vectors x (t) with the "time" step index t ranging from 1 Toτ. Usually, there is also a hidden state vector H (t) for the each time step T. 10.1 Unfolding, computational >
Basic formula of RNN (10.4) is shown below:
It basically says the current hidden state H (t) are a function f of the previous hidden state h (t-1) and the current input X (t). The theta are the parameters of the function f. The network typically learns to use H (t) as a kind of lossy summary of the task-relevant aspects of the past sequence of I Nputs up to T.
Unfolding maps the right in the figure below (both are computational graphs of a RNN without output O)
Where the black square indicates and interaction takes place with a delay of the 1- The state in time t + 1.
Unfolding/parameter sharing is better than using different parameters/position:less parameters to estimate, Generaliz E to various length. 10.2 Recurrent Neural network
Variation 1 of RNN (Basic form): Hidden2hidden connections, sequence output. As in Fig. 10.3.
The basic equations that defines the above RNN are shown in (10.6) below (on pp. 385 to the book)
The total loss to a given sequence of x values paired with a sequence to yvalues would then be just the sum of the losses The time steps. For example, if L (t) is the negative log-likelihood
Y (t) given x (1), ..., X (t), then sum them up with the loss for the sequence as shown in (10.7): Foward Pass: The runtime is O (TAU) and cannot are reduced by parallelization because the forward propagation graph is inherently sequentia L The computed after the previous one. Backward Pass:see Section 10.2.2.
Variation 2 of RNN output2hidden, sequence output. As shown in Fig 10.4, it produces a output at each time step and have recurrent connections only to the output at one T IME step to the hidden units in the next time step
Teacher Forcing (Section 10.2.1, pp 385) can is used to train RNN as in Fig. 10.4 (above), where only Output2hidden connect Ions exist, i.e Hidden2hidden connections are.
In teach forcing, the model are trained to maximize the conditional probability of current output y (t), given both the X SE Quence so far and the previous output Y (t-1), i.e. use the Gold-standard output of previous time step in training.
Variation 3 of RNN Hidden2hidden, single output. As Fig 10.5 recurrent connections between hidden units, that is read a entire sequence and then a single output 10.2.2 Computing The gradient in a recurrent neural network
How? Use back-propagation through Time (BPTT) algorithm on the unrolled graph. Basically, it is the application of Chain-rule on the unrolled graph for parameters of U, V, W, B and C as the SE Quence of nodes indexed by T for X (t), H (t), O (t) and L (t).
Hope find the following derivations elementary ... If not, reading the book probably does is not help, either
The derivations are w.r.t. The basic form of RNN, namely Fig 10.3 and equation (10.6). We copy Fig 10.3 again here:
From PP 389:
Once the gradients on the internal nodes to the computational graph are obtained, we can obtain the gradients on the Param Eter nodes, which have descendents at the "Time Steps:
PP 390:
Note:we move Section 10.2.3 and Sec 10.2.4, both of which are about graphical model interpretation of RNN The notes, as they are not essential for the "idea flow," in my opinion ...
Note2:one may want to jump to section 10.7 and read till the end before coming back to 10.3–10.6, as those sections I n parallel with Sections 10.7–10.13, which coherently centers on the long dependency problem. 10.3 bidirectional Rnns
In many applications we want to output a prediction of Y (t) which could depend on the whole input sequence. e.g. Co-articulation in speech recognition, right neighbors in POS tagging, etc.
Bidirectional Rnns combine a RNN that moves forward through then beginning from the start of the sequence with another RN N that moves backward through time beginning from the end of the sequence.
Fig. 10.11 (below) illustrates the typical bidirectional RNN, where H (t) and G (t) standing for the (hidden) state of the Ub-rnn that moves forward and backward through time, respectively. This allows the output units O (t) to compute a representation which depends on both the past and the future but is most Sen Sitive to the input values around time t
Figure 10.11:computation of a typical bidirectional recurrent neural network, meant to learn to map input sequences x. TO Target sequences y, with loss L (T) in each step T.
Footnote:this idea can is naturally extended to 2-dimensional input, such as images with having four rnns ... 10.4 encoder-decoder sequence-to-sequence architectures
Encode-decoder architecture, basic idea:
(1) An encoder or reader or input RNN processes the input sequence. The encoder emits the context C and usually as a simple function of the final hidden state.
(2) A decoder or writer or output RNN is conditioned on, fixed-length vector to generate the output sequence y = (y 1 ),..., y (NY)).
Highlight:the lengths of input and output sequences can vary from. Now widely used in machine translation, question answering etc.
The Fig 10.12 below.
Training:two Rnns are trained jointly to maximize the average of Logp (Y (1),..., y (NY) |x (1),..., X (NX)) over all pairs of X And y sequences in the training set.
Variations:if the context C is a vector, then the decoder RNN is simply a vector-to-sequence RNN. As we have seen (in Sec. 10.2.4), there are on least two for a ways vector-to-sequence to receive input. The input can is provided as the initial state of the RNN, or the input can is connected to the hidden units at each time Step. These two ways can also to be combined. 10.5 Deep Recurrent Networks
The computation in most Rnns can is decomposed into three blocks of parameters and associated:
1. From the "input to" hidden state, X (t) →h (t)
2. From the previous hidden state to the next hidden State, H (t-1) →h (t)
3. From the hidden state to the output, H (t) →o (t)
These are transformations are represented as a single layer within a deep MLP in the previous discussed. However, we can use multiple layers for each of the above transformations, which results in deep recurrent.
Fig 10.13 (below) shows the resulting deep RNN, if we
(a) hidden to hidden,
(b) Introduce deeper architecture for all 1,2,3 transformations above and
(c) Add "Skip connections" for RNN that have deep hidden2hidden transformations. 10.6 Recursive neural network
A recursive network has a computational graph that generalizes that of the recurrent network from a chain to a.
Pro:compared with a RNN, for a sequence of the same lengthτ, the depth (measured as the number of compositions of Nonlin Ear operations) can be drastically reduced fromτto O (logτ).
Con:how to best structure? Balanced binary is a optional but not optimal for many data. For natural sentences, one can use a parser to yield the "tree structure", but this is both expensive and inaccurate. Thus recursive NN is not popular. 10.7 The challenge of long-term Dependency Comments:this is the challenge of RNN, which drives the rest of the chapter.
The long-term dependency challenge motivates various solutions such as Echo State Network (Section 10.8), leaky units (Sec 10.9) and the infamous lstm (sec 10.10), as ok as clipping gradient, neural Turing machine (sec 10.11).
Recurrent networks involve the composition of the same function multiple times, once each time step. These compositions can result in extremely nonlinear behavior. But let's focus on a linear simplification's RNN, where all is non-linearity are removed, for a easier demonstration of Why long-term dependency can be problematic.
Without non-linearity, the recurrent relation for H (t) w.r.t. h (t-1) are now simply matrix multiplication:
If We recurrently apply this until we reach H (0), we:
And if W admits an eigen-decomposition
The recurrence may is simplified further to:
In the "other words", the recurrence means that eigenvalues are raised to the power of T. This means is eigenvalues with magnitude less than one to vanish to zero and eigenvalues with magnitude greater one to explode. The above analysis shows the essence to the vanishing and exploding gradient for problem.
Comment:the trend of recurrence in matrix multiplication are similar in actual RNN, if we look back at 10.2.2 "Computing t He gradient in a recurrent neural network ".
Bengio et al., (1993, 1994) shows that whenever the model are able to represent long term dependencies, the gradient of a l Ong term interaction has exponentially smaller magnitude than the gradient of a short term interaction. It means it can be time-consuming, if not impossible, to learn long-term dependencies. The following sections are all devoted to solving this problem.
Practical tips:the maximum sequences length that sgd-trained traditional RNN can handle be only 10 ~ 20. 10.8 Echo State Networks
Note:this approach seems to being non-salient in the literature, so knowing the concept is probably. The techniques are only explained at a abstract level in the book, anyway.
Basic Idea:since The recurrence causes all vanishing/exploding problems, we can set the recurrent weights such He recurrent hidden units does a good job of capturing the history of past inputs (thus "echo"), and