Deep Learning Notes (iv): Cyclic neural network concept, structure and code annotation _ Neural network

Source: Internet
Author: User
Tags prepare

Deep Learning Notes (i): Logistic classification
Deep learning Notes (ii): Simple neural network, back propagation algorithm and implementation
Deep Learning Notes (iii): activating functions and loss functions
Deep Learning Notes: A Summary of optimization methods (Bgd,sgd,momentum,adagrad,rmsprop,adam)
Deep Learning Notes (iv): The concept, structure and code annotation of cyclic neural networks
Deep Learning Notes (v): lstm
Deep Learning Notes (vi): Encoder-decoder model and attention model

The conceptual and structural part of this article is excerpted from the astonishing validity of the cyclic neural network (above), the code part from minimal Character-level RNN language model in Python/numpy I made a detailed comment on the code


Cyclic neural network

An obvious limitation of sequential common neural networks and convolution neural networks is that their APIs are too restrictive: they receive a fixed-dimensional vector as input (such as an image) and produce a fixed-dimensional vector as output (for example, for different classifications). Moreover, these models are also fixed (such as the number of layers in the model) even for the calculation of the above mapping. The core reason why RNN is so exciting is that it allows us to manipulate the sequence of vectors: input can be a sequence, output can be a sequence, and in the most generalized case the input and output can be a sequence. Here are some intuitive examples:

Each square in the above image represents a vector, and the arrows represent functions (such as matrix multiplication). The input vector is red, the output vector is blue, and the green vector is the RNN state (described immediately). From left to right:

A rnn ordinary process, from a fixed-dimensional input to a fixed-dimensional output (such as image classification).
Output is a sequence (for example, an image callout: the input is an image, and the output is a sequence of words).
Input is a sequence (for example, emotional analysis: Input is a sentence, output is the classification of the sentence is positive or negative emotions).
Input and output are sequences (such as machine translation: RNN input An English sentence to output a French sentence).
Synchronized input and output sequences (such as video classification, we will label each frame of the video).
Note that the length of the sequence is not predetermined in each case because the cyclic transformation (green part) is fixed and we want to use it several times.

As you would expect, the sequence system is much more powerful than a fixed network that has been set up from the beginning with computational steps. And for people like us who want to build a smarter system, the network is also more appealing. As we'll see later, RNN combines its input vector, state vector and a fixed (learning) function to generate a new state vector. In the context of a program, this can be understood as running a fixed program with some input and internal variables. From this point of view, RNN is essentially describing the program. In fact, RNN are complete with Turing, and they can simulate arbitrary programs as long as they have the right weights. However, like the general approximation theory of neural networks, you don't have to focus too much on the details. In fact, I suggest you forget what I said just now.

If training the common neural network is to optimize the function, then the training Loop network is optimized for the program.

No sequences can also be serialized. You might think that it's relatively rare to have a sequence as input or output, but it's important to realize that even if the input or output is a fixed-dimensional vector, you can still use this powerful formal system to process them in a serialized way. For example, the following figure comes from the two very good essays in DeepMind. The left-moving graph shows an algorithm that learns the strategy of a circular network that directs it to observe the image, and more specifically, it learns how to read the building's number from left to right. The right motion diagram shows a circular network that adds color to the canvas by learning to serialize, creating a picture that has a number written on it.

Left: Rnn Learn how to read building numbers. Right: Rnn learned to draw the building number.

It is important to understand that even if the data is not in the form of a sequence, a powerful model capable of serializing data can be built and trained. In other words, you're going to let the model learn a phased program that handles fixed-size data.

Calculation of RNN. So how does RNN work in the end? At its core, RNN has a seemingly simple API: it receives input vector x, and returns the output vector y. However, the contents of this output vector are not only affected by input data, but also by the entire history input. If you write a class, the RNN API contains only one step method:

RNN = RNN ()
y = rnn.step (x) # x is a input vector, y is the rnn ' s output vector

Whenever the step method is invoked, the internal state of the RNN is updated. In the simplest case, the interior contains only an internal implicit vector h. The following is the implementation of a common RNN step method:

Class RNN:
  #
  ... def step (self, x):
    # Update the hidden state
    self.h = Np.tanh (self. W_HH, self.h) + Np.dot (self. W_XH, X))
    # Compute the output vector
    y = Np.dot (self. W_hy, self.h) return
    y

The above code details the forward propagation of ordinary rnn. The RNN parameters are three matrices: w_hh, W_xh, W_hy. The hidden state self.h is initialized to the zero vector. The Np.tanh function is a non-linear function that squeezes the activation data into [ -1,1]. Notice how the code works: there are two parts in the Tanh. One is based on the previous hidden state and the other is based on the current input. In NumPy, the Np.dot is a matrix multiplication. Two intermediate variables are added, and the results are processed by Tanh as a new state vector. If you prefer to use a mathematical formula, then the formula is this:
Ht=tanh (WHHHT−1+WHXXT)
Where Tanh is operated on a per-element basis.

We use random numbers to initialize RNN matrices, perform a lot of training to find matrices that generate descriptive behavior, and use some loss functions to measure descriptive behavior, which represents your preference for some output y based on input x.

The deeper network RNN is a neural network algorithm, and if you begin to overlap the model like a pancake to learn deeply, the performance of the algorithm will rise monotonously (if it doesn't go awry). For example, we could build a 2-tier loop network like the following code:

Y1 = rnn1.step (x)
y = rnn2.step (y1)

In other words, we have two rnn: one rnn takes the input vector and the second rnn the output of the first RNN as its input. In fact, as far as RNN itself is concerned, they don't care who is who's input: all vectors are in and out, all of which are gradient through each model in reverse propagation.

A better network. What needs to be stated briefly is that in practice a slightly different algorithm is commonly used, which is the long and long base memory network I mentioned earlier, referred to as lstm. Lstm is a special type of circular network. Because of its more powerful renewal equation and better dynamic reverse propagation mechanism, it has a better effect in practice. This article will not be introduced in detail, but in this algorithm, all the content described in this article about RNN will not change, the only change is the status update (Self.h= ...). That line of code) becomes more complex. From here, I'll mix the terms rnn and lstm, but all the experiments in this article are done with lstm.


Letter-level language model

Now we have understood what RNN is, why they are exciting, and how they work. Now it's a more profound experience with an interesting application: we're going to use RNN to train an alphabet-level language model. That is, give rnn a huge amount of text, and then let it model and give the probability distribution of the next letter according to the previous letter in a sequence. This allows us to generate new text one letter at a letter.

In the following example, suppose that our alphabet consists of only 4 letters of "helo" and then use the training sequence "Hello" to train rnn. The training sequence is actually composed of 4 training samples: 1. When h is above, the probability of selecting the letter below should be the highest of E. 2.L should be the following of he. 3.L should be the context of the Hel text. 4.O should be the context of the hell text.

Specifically, we will encode each letter into a vector of 1 to K (except for the corresponding letter 1 for the remainder of 0), and then use the step method to input it one at a time to the RNN. A sequence of 4-D vectors (one-letter dimension) is then observed. We interpret these output vectors as the degree of confidence in the rnn of the next letter of the sequence. Here is the flowchart:


A RNN example: the input output is a 4-D layer, and the number of hidden neurons is 3. The flowchart shows the process of activating data forward propagation in RNN when using hell as input. The output layer contains RNN confidence about the next letter selection (the alphabet is helo). We want the green number to be big and the red number small.

For example: In the first step, RNN saw the letter H, given the confidence of the next letter is H for 1,e 2.2,l for -3.0,o 4.1. Because the next correct letter in the training data (the string hello) is E, we want to increase its confidence (green) and lower the confidence of other letters (red). Similarly, there is a target letter at each step, and we want the algorithm to assign a greater degree of confidence to the letter. Because the entire operation contained in the RNN is differentiable, we can get the correct direction of weight adjustment by backward propagation of the algorithm (recursive use of the chain rule in calculus), in the correct direction, we can raise the score of the correct target letter (green bold number). The parameter is then updated to move the weight slightly in that direction. If we enter the same data into RNN, the parameters will be updated to find that the correct letter score (such as e in the first step) will be higher (for example, from 2.2 to 2.3) and the incorrect score will be lowered. Repeating a process many times until the network converges, the prediction is consistent with the training data, and always correctly predicts the next letter.

A more technical explanation is that we use the standard Softmax classifier (also known as Cross entropy loss) for the output vector synchronization. Use small batches of random gradient drops to train rnn, and use Rmsprop or Adam to stabilize the updates for parameters.

Note that when the letter L first enters, the target letter is L, but the second target is O. Therefore, RNN cannot rely solely on input data and must use its looping connections to keep track of the context in order to complete the task.

In the test, we enter a letter to the RNN to get the score distribution of the next letter. We take out the highest-scoring letter based on this distribution and then input it to RNN to get the next letter. Repeat the process and we'll get the text. Now use a different dataset to train RNN and see what happens.

In order to better introduce, I wrote code based on teaching purposes, only more than 100 lines

"" "Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy) BSD License "" "Import NumPy as NP import Jieba # data I/O data = open ('/home/mult Iangle/download/280.txt ', ' RB '). Read () # should is simple plain text file = Data.decode (' GBK ') data = List (Jieba.cut (  Data,cut_all=false)) chars = List (set (data)) data_size, vocab_size = Len (data), Len (chars) print (' data has%d characters,  %d unique. '% (data_size, vocab_size)) Char_to_ix = {ch:i for i,ch in Enumerate (chars)} Ix_to_char = {i:ch for i,ch in Enumerate (chars)} # Hyperparameters hidden_size = # size of hidden layer of neurons seq_length = # number of S Teps to unroll the RNN for learning_rate = 1e-1 # model Parameters WxH = NP.RANDOM.RANDN (hidden_size, vocab_size) *0.01 # Input to hidden whh = Np.random.randn (hidden_size, hidden_size) *0.01 # hidden to hidden Why = Np.random.randn (Vocab_size, hidden_size) *0.01 # hidden to Output BH = Np.zeros ((hidden_size, 1)) # hidden bias by = Np.zeros (VOcab_size, 1)) # output bias def lossfun (inputs, targets, Hprev): "" "Inputs,targets are both list of integers.
    Hprev is Hx1 array of initial hidden state returns the loss, gradients in model parameters, and last hidden state "" "Xs, HS, YS, PS = {}, {}, {}, {} Hs[-1] = Np.copy (hprev) # Hprev The value of the middle tier, stored as-1, ready for the first loss = 0 # fo
        Rward pass to T in range (len (inputs)): xs[t] = Np.zeros ((vocab_size,1)) # Encode in 1-of-k representation XS[T][INPUTS[T]] = 1 # X[t] is a T-input vector # hyperbolic tangent of a word, activating function that acts like a sigmoid # h (t) = Tanh (Wxh*x + whh*h ( t-1) + BH) generates a new middle layer hs[t] = Np.tanh (Np.dot (WxH, xs[t)) + Np.dot (WHH, hs[t-1]) + BH) # hidden State Tanh # Y (t) = why*h (t) + by ys[t] = Np.dot (Why, hs[t]) + by # unnormalized log probabilities for next chars # SOF Tmax regularization # p (t) = Softmax (Y (t)) ps[t] = Np.exp (ys[t))/Np.sum (Np.exp (Ys[t))) # probabilities F or next chars the output as SoftmaX # Loss + =-log (value) expected output is 1, so the value here is the cost function, using-log (*) so that the farther away from the correct output, the higher the cost function loss + =-np.log (Ps[t][tar gets[t],0]) # Softmax (cross-entropy loss) cost function is cross entropy # after the input loop, get the H, Y and P # of each time period to get the cumulative loss at this time, prepare to update the matrix # BA Ckward Pass:compute gradients going backwards, dwxh dwhh, dwhy = Np.zeros_like (WxH), Np.zeros_like (WHH), Np.zeros_lik E (Why) # The parameters of each matrix are dbh, Dby = np.zeros_like (BH), Np.zeros_like (by) Dhnext = Np.zeros_like (hs[0)) # The potential layer of the next time period, initialize is the zero vector for T in reversed (range (len (inputs)): # The time as a dimension, then the gradient calculation should be along the time backtracking dy = np.copy (ps[t]) # set DY as the actual output, and the desired output (unit to ) is Y, and the cost function is the cross-entropy function dy[targets[t]] = 1 # backprop into Y., http://cs231n.github.io/neural-networks-case-study/#gr Ad if confused here dwhy + = Np.dot (dy, hs[t]. T) # dy * H (t). The greater the value of the T-H layer, the more severe the penalty is if the error occurs. Conversely, the more rewards (this does not seem to consider the derivation of Softmax. Dby + dy # This is nothing to say, just like dwhy, except H = 1, so directly equals dy DH = Np.dot (why.t, dy) + Dhnext # backprop into h z_t = why*h_t + b_y h_t = Tanh (whh*h_t-1 + whx*x_t), the first phase derivative Dhraw = (1-hs[t] * hs[t]) * DH # Backprop through Tanh nonlinearity the second phase of derivation, pay attention to the derivation of Tanh DBH + = Dhraw # DBH Indicates the error passed to the H-level dwxh + = Np.dot (Dhraw, xs[t). T) # Correction to WxH, with why dwhh + = Np.dot (Dhraw, hs[t-1).
        T) # correction of whh Dhnext = Np.dot (whh.t, Dhraw) # H-layer error accumulated for WHH by Dparam in [Dwxh, Dwhh, dwhy, DBH, Dby]: Np.clip (Dparam, -5, 5, Out=dparam) # Clip to mitigate exploding-gradients return loss, Dwxh, dwhh, dwhy, DBH, Dby, Hs[len (inputs)-1] def sample (H, Seed_ix, N): "" "" "" "sample a sequence of integers of the model H is Memor Y state, Seed_ix are seed letter for the "" "X = Np.zeros ((vocab_size, 1)) X[seed_ix] = 1 ixes = [] for T in range (n): h = Np.tanh (Np.dot (WxH, x) + Np.dot (whh, h) + BH) # update middle tier y = Np.dot (Why, H + by # get output P = np.exp (y)/np.sum (Np.exp (y)) # Softmax IX = Np.random.choice (range (vocab_ Size), p=p.Ravel ()) # According to the result of Softmax, the next character x = Np.zeros ((vocab_size, 1)) is generated by probability to generate the next round of input x[ix] = 1 IX Es.append (ix) return ixes N, p = 0, 0 mwxh, mwhh, mwhy = Np.zeros_like (WxH), Np.zeros_like (WHH), Np.zeros_like (Why) m BH, Mby = np.zeros_like (BH), Np.zeros_like (by) # Memory variables for Adagrad Smooth_loss =-np.log (1.0/vocab_size) *seq_le
    Ngth # Loss at Iteration 0 while True: # Prepare inputs (we ' re sweeping from left to right in steps seq_length long) If p+seq_length+1 >= len (data) or n = 0: # If n=0 or P is too large Hprev = Np.zeros ((hidden_size,1)) # Reset RNN Me Mory Middle-tier content initialization, 0 initialize p = 0 # Go from start of data # p Reset inputs = [CHAR_TO_IX[CH] for ch in data[p:p+ Seq_length]] # A batch of input seq_length characters targets = [char_to_ix[ch] for CH in data[p+1:p+seq_length+1]] # targets is the corresponding inputs expectation

    Output. # sample from the model now and then if n% 100 = 0: # 100 words per cycle, sample once, display results Sample_ix = sample (Hprev, Inputs[0], 200)
        txt = '. Join (Ix_to_char[ix] for IX in Sample_ix) print ('----\ n%s \ n----'% (TXT,)) # forward SEQ _length characters through the net and fetch gradient loss, Dwxh, dwhh, dwhy, DBH, dby, Hprev = Lossfun (inputs, target S, hprev) Smooth_loss = Smooth_loss * 0.999 + loss * 0.001 # combine the original loss with the new loss if n% = 0:print (' iter%d , loss:%f '% (n, smooth_loss)) # Print Progress # perform parameter update with Adagrad for Param, Dparam, mem i
                                  n Zip ([WxH, WHH, Why, BH, by], [Dwxh, Dwhh, dwhy, DBH, Dby], [Mwxh, Mwhh, mwhy, mbh, Mby]): Mem + + Dparam * Dparam # gradient cumulative param + =-learning_rate * dparam/n P.SQRT (mem + 1e-8) # Adagrad Update as the number of iterations increases, the change in parameters is getting smaller p + + seq_length # move data pointer n = 1 # Iteration C Ounter, Cycle times

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.