Derivation of BP algorithm for deep Learning (explanation of derivation of additional rnn,lstm)

Last Update:2016-10-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Note: 1) This article mainly refers to Alex Graves's doctoral dissertation "supervised Sequence labelling with recurrent neural Networks" in detail its derivation process of BP algorithm.

2) The paper does not refer to the treatment of deviations, but if you can patiently deduce the formula given in the paper, then this will be very simple.

3) because it is a combination of voice training, and finally softmax out is to obtain a finite result of the probability value, and then do cross-entropy as the objective function, so it may be inconsistent with other networks, but the feedback should be the same derivation, in addition to the output layer and the last hidden layer between the derivation.

If you want to see the detailed rnn,lstm formula, look at that doctoral thesis (just give the formula not to the derivation process, but the mathematical good should be able to "obviously this! ”）

4) recently found a very good paper "deep sentence embedding Using the Long short term Memory network:analysis and application to information Re Trieval "

This article on the derivation of lstm very good, a general look at it is easy and Baidu are good Yu eldest brother to Kaldi write the LSTM structure of the BP algorithm combined (currently not see).

Reprint please retain the source http://blog.csdn.net/hongmaodaxia/article/details/41809341

1.MLP Multilayer Perceptron I think it's a little bit more about BP before I look at the formula. The purpose of establishing a network is to fit a nonlinear function, and the end result is that the whole network is a function. The parameters of the function are the weights inside, so the main purpose is to train them to get these parameters. Generally is to establish a goal function O, and then optimize it, because it is related to the weight w, so use gradient descent method, to update the weight, so the ultimate goal in the text is to seek do/dw ... I will not say the basic knowledge. The following picture is I combined the DNN code in Kaldi to draw the diagramI do not know for the next talk of no help, I think the Kaldi code and paper here is the same, just the code in the weight and layer as a different component.
1.1 Forward propagation

Normal BP forward propagation is simple, if the last layer is not the SOFTMAX layer, then is the input x weight and then through an activation function (eg. sigmoid), plus deviations as output. In order to simplify the following straight through, and then detailed introduction of each step of the role.

Where I represents the number of output units, the input data is vector x, the following table with H is expressed in the hidden layer. The left two formula is the output of one unit (unit) of the first hidden layer. 3.1, can be understood as a transitional equation, but this equation is very important, the later derivation is the use of the objective function of a derivative as the basic structure (I guess this is also considered after the face of the weight w derivative convenient). The right two formulas are the output of the rest of the hidden layers (because this is the depth network, so there are multiple hidden layers)

That Theta is a non-linear function (activation function), generally has the following two options:

So select nonlinear function, because it can fit the linear function, find the non-linear classification boundary (nonlinear classi cation boundaries), while the combination of linear function is linear, but the nonlinearity can better "remember" the characteristics of the input data. (This goes on to translate, I still pay attention to derivation, anyway, everyone's basic knowledge of Ann better than me)

Forward propagation to the last, of course, the output data and the original data to compare and then update the weights (here is the supervised learning). For the speech recognition network structure CD-DNN-HMM familiar, know that the last output of the network is about each phoneme or triphone probability , and the probability is obtained through the last Softmax layer and output (this layer is very special, Unlike the original each node for the output B, but the layer of all the nodes of a are found, together to find B "probability"). The formula is as follows

The one on the left is the standard Softmax function, which asks for the probability value of the CK class. The right of the function z is the label (or should be the correct result, but entirely by a 1 of the remaining 0 composition such as [0,1,0,0,0], 1,0 actually represents the probability value, they sum to 1, can be understood as a total of 5 classification results, the input should correspond to the second category)

1.2 target function o

The combination of 2.11-type (below, is to seek cross-entropy) and 3.13 can be obtained 3.15--we need to minimize the objective function (the smaller the cross-entropy, the more the model approximation to the real results)

( Note:o=f (x) is to minimize f (x), of course if f (x) =-f (x) is to maximize f (x))

The objective function used to be the mean square error.

Minimizing the objective function o requires the gradient descent method "should be a random gradient descent, because basically training a data (or mini-batch), updating the weights, instead of all the training and then update"

About the gradient descent method, to understand clearly, otherwise you do not know why to do so, you can refer to the blog http://www.cnblogs.com/iamccme/archive/2013/05/14/3078418.html. Simply put, if you want to update the weight w, you need to know the update amount, you can choose the maximum value of the gradient (the legendary derivative), if the + is to maximize the objective function, if the reduction is to minimize the target function. To prevent too fast/slow multiply by a factor.

1.3 Back propagation

This is the real difficult place. Roughly speaking, since the last layer is a special softmax layer, so separate proof, the back is exactly the same, by finding a recursive algorithm. Makes the formula very concise.

Again, AH refers to the following input x weight sum value, not through the node, and BH refers to ah through the node's activation function θ value, then this BH multiplied by the weight of WHK and other sums, and then get the upper layer of AK, and the author of the thesis is to Do/da as the core, and then derive a recursive formula.

First look at the derivation of this layer of Softmax

According to the 3.15 and 3.13, it is not difficult to think of 3,20 "remember that each cell/node's Softmax value depends on this layer of all a"

For the 3.22-style, interested students can push themselves, the following is my derivation process, such as under, do not know whether accurate

Derivation of other layers by recursive method

I've been emphasizing this for a long time, because when I deduced rnn and lstm, I fully discovered the subtlety of it, and the author gave the formula directly in the next 3.23

Δk in 3.22 has been, then, the paper through a simple derivation of the transformation of the recursive formula, the following slowly, the first is the second-to-last hidden layer

Similar to 3.1 has

explanation : The above equation, unknowingly involved in the two hidden layer, AH produced bh,bh AK, remember that a only through the matrix multiplication has not passed the node, and B is a after the value.

Give me a picture of it,

Through the 3.24-type and 3.2,3.9-type combination, you can get 3.25-way, so that in addition to the final SOFTMXT layer of the other layers can be 3.36 such as the return calculation

Although the recursive relationship has been found, but a friend to ask, should not be asked to the inverse of the weight w? Why haven't you seen it? What I want to say is that if the first direct use of the inverse of W to deduce, then the above formula will be complex, and the paper lstm the formula listed below (the paper also did not give specific derivation, need to push) also use Do/da summed up the formula ... No, let's look at the following to find out how simple DW is!

2.RNN Recurrent neural Networks (recurrent neural Network)

The structure of the network is every hidden layer, with the input value of the hidden layer output at the last moment. The original text is as follows

2.1 Forward Propagation

Just look at the formula, it's simple, pay attention to the superscript is the time t

Of course, a of the output layer is the same as before.

2.2 Back propagation

First give the formula

In fact, there is nothing difficult, but there is a place to carefully try to figure out, carefully observe the formula 3.23 used T+1, if the derivation of ideas clear can be pushed, out. But it can also be understood that the output BH of H layer in forward propagation affects the H layer t+1 moment, So according to the law of the plural function derivation, naturally to add to the t+1 moment of influence, or simply as there is not only the K layer and H (t+1) layer, so as to understand why Δ (t+1), " understand this is very important, otherwise lstm the formula will be more difficult to understand "

The countdown to W is as simple as usual.

3 LSTM Long short-term memory (short and long)
（ This picture has some local inaccurate, such as the position of H should be from the CEC to do H, there is output where the minimum to the meaning of a loopback, there is each big circle should have two pointing arrows, here actually drew three）

Here, the friends may be disappointed, because next I also do not stick the structure chart or the formula. Here are some experiences for students who want to derive the LSTM formula.

1) First to see the structure diagram, to control the various definitions of the paper, combined with forward Pass, note the figure of c=1, in fact, this memory block should have multiple outputs. Don't look at backward Pass first.

2) always be alert to the foot Mark T, T+1, t-1

3) Start deriving the post-to-algorithm friend, and don't be afraid, in fact, that's what happened.

in the derivation process, Z to T may be spaced several layers, rather than one layer, it is important to understand who is the last derivative of the object. In general except like the 3.24 that need to change, the rest as long as according to the image of the arrows in the reverse image can be slowly chain-like to find out . Of course, this is my understanding is not necessarily right, everybody mathematics good may say, this still use beg? Apparently so, just like 1+1=2.

This piece of content I have seen for a long time, these three deduction for about 5 days, a lot of notes on the printed paper, here can only pick a little to prevent themselves from forgetting, but also for the introduction of friends to provide some small help.

4. CTC's Derivation instructions

Directly say the author of the formula 7.29 Bar, combined with the formula 7.26,7.27 and I face the 3.22-style detailed deduction, found that only with the z change, immediately get the results.

( summary: such as 7.26, o after the derivation of the Y, the expression is the form of f/y, you can directly apply my deduction above, to get the result "Z directly replaced with F")

Write the program to pay attention to the error there with the log, and then use exp. The given log scale formula can be deduced directly in this paper.

C + + implementation of log scale

Double Chj_lossfunc_log_add (double a,double b,bool hasdone=false) {if (!hasdone) {assert (a>0&& b>0); a= Log (a); B=log (b);} if (b>a) {a+=b; b=a-b; a-=b;}//Exchange to ensure that a is a large return A+log (1+exp (b-a)); }double Chj_lossfunc_log_add_mul (std::vector<double> & Vec,bool Hasdone=false) {/* ask for log plus and it change VEC If it is implemented recursively, it is roughly as follows if (n==0) {return A;} Else{return A+log (1+exp (Chj_lossfunc_log_add_mul (Vec,hasdone)-a));} */int32 n=vec.size (); assert (n>0);d ouble a=vec[--n];vec.pop_back (); if (!hasdone) {assert (a>0); A=log (a);} For (Std::vector<double>::iterator It=vec.begin (); It!=vec.end (); it++) {A=chj_lossfunc_log_add (a,*it,hasdone );} Vec.clear (); return A;}

Derivation of BP algorithm for deep Learning (explanation of derivation of additional rnn,lstm)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Derivation of BP algorithm for deep Learning (explanation of derivation of additional rnn,lstm)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support