"Turn" cyclic neural network (RNN, recurrent neural Networks) study notes: Basic theory

Source: Internet
Author: User

Transfer from http://blog.csdn.net/xingzhedai/article/details/53144126

More information: http://blog.csdn.net/mafeiyu80/article/details/51446558

http://blog.csdn.net/caimouse/article/details/70225998

http://kubicode.me/2017/05/15/Deep%20Learning/Understanding-about-RNN/

RNN (recurrent Neuron) is a neural network for modeling sequence data. Following the bengio of the probabilistic language model based on neural network and its success, Mikolov proposed the use of RNN Modeling Language model in 2010, Sundermeyer proposed RNN improved version--lstm in 2012. In the past two years, RNN has been rapidly applied in the fields of natural language processing, image recognition, speech recognition and so on. As a result of the project needs, the recent focus on the study of these types of learning models, DNN, RNN, lstm, and so on, the following will be the study summary records and published, first for their own deepening the impression, and secondly if you can provide some help to others better.

?? The cyclic neural network (recurrent neural Networks,rnns) has been greatly successful and widely used in many natural language processing (Natural Language processing, NLP), so search RNN can search a lot of data, Therefore, this article only from the perspective of their own understanding of the principle of rnns and how to achieve, the latter will be specially sent a blog with the actual source code for analysis and learning:

1. The basic principle and derivation of RNN

2. About RNN

1. The basic principle and derivation of RNN

(1) What is Rnns

?? The purpose of Rnns is to use to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layer and the layer are fully connected, the nodes between each layer is not connected. But this common neural network is powerless for many problems. For example, if you want to predict what the next word of a sentence is, you generally need to use the preceding word, because the words in a sentence are not independent. Rnns is called a recurrent neural network, i.e. the current output of a sequence is also related to the previous output. The concrete manifestation is that the network will remember the previous information and apply to the calculation of the current output, that is, the nodes between the hidden layers are no longer connected, but the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the last moment. In theory, Rnns is capable of processing any length of sequence data. But in practice, in order to reduce complexity, it is often assumed that the current state is related only to the previous states, which is a typical rnns:

Different from the traditional machine learning model, the hidden layer elements are completely equal to each other, the rnn middle of the hidden layer from left to right there is a time series (the Arabs see from right to left sometimes order, haha), so the hidden layer element is to pay attention to first served. One more local feature Photo:

(2) How Rnns works

The Rnns contains input cells (inputs units), the input set is labeled {x0, X1,..., xt, Xt+1,...}, represented by vectors as X (t), and the output set (output units) is marked {y0, y1,..., yt, yt+1. ,..} , which means Y (t) as the vector form. Rnns also contains hidden cells (Hidden units), and we mark their output set as {s0,s1,..., st, St+1,...}, which represents a vector form of s (t), which completes the most important work. You will find in the picture: There is a one-way flow of information from the input unit to the hidden unit, while the other one-way flow of information from the hidden unit to the output unit. In some cases, Rnns will break the limitations of the latter, directing information from the output unit back to the hidden unit, these are called "projections", and the input of the hidden layer also includes the state of the previous hidden layer, that is, the nodes in the hidden layer can be self-connected or interconnected.
?? The recurrent neural network is expanded into a whole neural network. For example, for a statement that contains 5 words, the expanded network is a five-layer neural network with each layer representing a single word. The calculation process for this network is as follows:
Step1:x (t) denotes the moment of T (t=1,2,3 ...) Input, for example, X1 is the Vow (Vector-of-word) vector of the second word in the current input sentence, PS: Using a computer to process natural languages requires that natural language processing be a symbol that can be recognized by the machine, plus that it needs to be numerically computed during machine learning. Word is the basis of natural language understanding and processing, so it needs to be numerically, word vector (word representation,word embeding) [1] is a feasible and effective method. What is a word vector, even if a word is represented by a real vector V of a specified length. One of the simplest representations is to use the one-hot vector to denote a word, which is based on the number of words | V| generates a | v| * 1 vector, when one of the other bits is zero, then the vector represents a word. Therefore, before training to build a dictionary (the workload is not small), so there is a more efficient word vector pattern, the pattern is to train through the neural network or deep learning words, the output of a specified dimension of the vector, which is the expression of the input word. such as Word2vec (also the result of God Ox Mikolov at Google).
Step2:s (t) is the state of the hidden layer of the T-moment, which is the memory unit of the network. S (t) is calculated based on the output of the current input layer and the state of the hidden layer in the previous step. S (t) =f (u*x (t) +w*s (t?1)), where f () is generally a nonlinear activation function, such as Tanh or Relu or sigmoid, in the calculation of s (0), that is, the first word of the hidden layer state, need to use the S (? 1), in the implementation of the general set to 0 vector;
Step3:o (t) is the output of the T moment, which is the vector representation of the next word, O (t) =softmax (v*s (t)).
It is important to note that the hidden layer state s (t) is the memory unit of the network and contains the hidden layer state of all previous steps. While the output layer output O (t) is only related to the current step s (t), in practice, in order to reduce the complexity of the network, often s (t) contains only a few steps ahead rather than all the hidden layer state, in the traditional neural network, the parameters of each network layer is not shared. In Rnns, each step is entered, each layer shares the parameters u,v,w u,v,w. Each step in the Rnns is doing the same thing, but the input is different, thus greatly reducing the parameters that need to be learned in the network.

(3) Tell me again how RNN works (Detailed derivation) (this part is from the ppt of God Ox, feeling is the most easy to understand in each version, or the information of the first hand is the most valuable)

    • Input layer W and output layer y have the same dimensionality as the vocabulary (10k-200k);
    • Hidden layer S is orders of magnitude smaller (50-1000 neurons);
    • U are the matrix of weights between input and hidden layer, V is thematrix of weights between hidden and output layer
    • Without the recurrent WEIGHTSW, this model would is a Bigram neuralnetwork language model.

In the upper left corner of the input signal, in the following deduction with X (t) to be expressed, so as not to confuse.

The output of the hidden layer is s (t), s (t) = f (u*w (t) + w*s (t-1)) (1)

Output layer output is Y (t), Y (t) = g (v*s (t)) (2)

Wherein, f (z) and g (z) are sigmoid and Softmax activation,

The training process uses a random gradient descent (SGD), you, V, and W are updated each time a word is entered, updated with the inverse propagation algorithm, the formula for the error (here called the Cross-Entropy) (4):

 

where D (t) is a target vector that represents the word W (t + 1) (encoded as 1-of-v vector)

Update of the coefficient matrix V:

 

The propagation of the gradient error of the output layer to the hidden layer is:

 

The where the error vector is obtained using function dh (), that isapplied element-wise:

Note: x here is not an input signal

Update of the coefficient matrix U, note that W (t) here should be the input signal X (t):

RNN can also continue to expand into the recursive structure above, correspondingly, the error propagation function of the hidden layer can also be written in the following recursive form:

 

The update of the weight coefficient w is written as a recursive form:

 

2. Introduction to RNN (also included in his lecture commentary at Sigir 2016 conference on Neuro-Information retrieval (Neu-ir Workshop)

Learn RNN, have to mention Tomas Mikolov, he is RNN Modeling Language model of the proposed (not the creator of RNN), this man should be first in Google to do natural language processing research, as a member of the Google Brain team, participated in the The development of the Word2vec project, which went to the Facebook AI Lab as a research scientist in 2014, wrote on his Facebook personal page that his long-term research goals were " Develop intelligent machines that can learn and communicate with humans in natural languages ", interested students can add him to the FB to chat with friends:-)

His lecture commentary at the Sigir 2016 conference on Neuro-Information retrieval (Neu-ir Workshop) http://chuansong.me/n/464503442191

"Turn" cyclic neural network (RNN, recurrent neural Networks) study notes: Basic theory

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.