Inverse propagation algorithm (process and formula derivation)

Last Update:2017-10-29 Source: Internet

Author: User

Tags network function

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first, the origin of the reverse transmission

Before we start the DL study, we need to make a simple explanation of the ann-artificial neural network and BP algorithm.
On the structure of Ann, I no longer say that there is a large number of online learning materials, mainly to clarify some of the nouns:
Input layer/input neuron, output layer/output neuron, hidden layer/hidden layer neuron, weight value, bias, activation function

Next we need to know how the Ann is trained, assuming that the Ann Network has been set up, in all application problems (regardless of network structure, how the training means change) Our goal is not changed, that is, the weight and bias of the network eventually become a best value, This value allows us to get the desired output from the input, so the problem becomes y=f (x,w,b) (x is the input, W is the weight, B is biased, all these quantities can have multiple, such as multiple x1,x2,x3 ... Finally F () is like our network it must be represented by a function, we do not need to know what the f (x) specific function, from a small we think as long as the function is to be represented, like F (x) =sin (x), but please reject such a misconception, We just need to know that a series of W and B determines a function f (x), which allows us to calculate a reasonable y from the input.

The final goal is to try different w,b values so that the last Y=f (x) is infinitely close to the value we want to get t

But the problem is still very complicated, let's simplify it and let the value of (Y-T) ^2 be as small as possible. So the original question was converted to C (w,b) = (f (x,w,b)-T) ^2 to a value as small as possible. This problem is not a difficult problem, no matter how complex the function, if C is reduced to a value that can no longer be reduced, then the minimum value is taken (assuming we do not consider the local smallest case)

How to drop? Math tells us about a function of a multivariable f (a,b,c,d,......) , we can obtain a vector, which is called the gradient of the function, it is important to note that the gradient is a direction vector, which represents the direction of the function at the point of the highest rate of change (this theorem is not explained in detail, can be found in the higher mathematics textbook) so C (w,b) change Δ can be expressed as

which

is a small change at this point, we can arbitrarily specify these small changes, just to ensure that the δc<0 can be, but for a faster descent, why do we not choose to change in the gradient direction?

In fact, the idea of a gradient descent is that, we make sure that C keeps decreasing, and for W, it's just for each update.

OK, here, it seems that all the problems have been solved, let us rearrange our thoughts, we have changed the problem a lot of steps:
Network weight bias update problem ==> f (x,w,b) results approximation T ==> C (W,B) = (f (x,w,b)-T) ^2 take the minimum problem ==> C (w,b) by gradient descent problem ==> take the minimum, the network to achieve optimal

Don't forget a little!! Derivation is based on the premise that we know the gradient of the current point in advance. But that's not the case!!
This problem has plagued the NN researchers for many years, in 1969 M.minsky and S.papert's "Perceptual Machine" book published, it on the single-layer neural network in-depth analysis, and from the mathematical proved that this network function is limited, and can not even solve such as "XOR" The simple logic operation problem. At the same time, they also found that there are many patterns can not be trained on a single-layer network, and for multilayer networks there is no effective low-complexity algorithm, and finally they even think that the neural network can not deal with nonlinear problems. In 1974, however, Paul Werbos first gave a learning algorithm-back propagation on how to train a general network. This algorithm can efficiently calculate the gradient of each iteration, let the above deduction be realized!!
Unfortunately, no one in the entire artificial neural network community was aware of Paul's proposed learning algorithm. It was not until the mid-80 that the BP algorithm was discovered independently by David Rumelhart, Geoffrey Hinton and Ronald Williams, David Parker and Yann LeCun, and gained extensive attention, caused the second upsurge in the field of artificial neural networks.

second, the introduction of the principle

As mentioned above, the so-called reverse propagation is the method of calculating gradients. For the reverse propagation, it is not urgent to introduce its principle, many articles directly into the formula, but it makes it difficult to understand. This is the first to introduce the answer of a great God who knows.

Source: Know https://www.zhihu.com/question/27239198?rf=24827633

Assuming the input a=2,b=1, in this case, it is easy to find the biased relationship between adjacent nodes

Using the chain rule:

And

The value equals the product of the partial derivative on the path from A to E, and the value is equal to the product of the partial derivative on path 1 (B-C-E) from B to e plus the product of the biased value on path 2 (B-D-E). In other words, for the upper node p and the lower node q, it is necessary to find all the paths from the Q node to the P node, and for each path, the product of all the partial derivatives on the path, and then the "product" of all the paths to accumulate to get the value.

In this case, the bias is easily obtained because we already know the function of the network, e= (a+b) * (b+1), which is a network with no weighted intervention, known as the relationship between input and output. In practice we just know the relationship between E and the output, which is said c= (Y-T) ^2, and there will be tens of thousands of weights and bias intervention derivative process. So change the idea, can you ask for the output of the results of the biased guide it?

Re-use of the relationship. Node C to the E-bias 2 and the results are stacked up, node D to e-biased 3 and the results stacked up, so that the second layer is complete, to find out the total amount of each node stacking and continue to send down a layer. Node C sends 2*1 to a and stacks up, node C sends 2*1 to B and stacks up, node D sends 3*1 to B and stacks up, so that the third layer is finished, node A is stacked up to a volume of 2, Node B is stacked up in the amount of 2*1+3*1=5, that is, the apex E to B partial derivative of 5. A brief summary is to start with the top-level node E and process it in layers. For all child nodes of the next layer of E, multiply 1 by E to the biased value on a node path, and the result is "stacked" in that child node. After the layer of e is spread, each node in the second layer "stacks" some values, and then we sum all the "stacked" values in it for each node, and we get the vertex E's bias on the node. Then the nodes of the second layer respectively as the starting vertex, the initial value is set to the vertex e of their partial derivative, the "layer" as a unit repeating the above propagation process, you can find the vertex e on each layer of the partial derivative.

a city is a good example.Now that we take the weights into account, here's a good example that will help us understand the reverse propagation Source: Charlotte77 's blog http://www.cnblogs.com/charlotte77/p/5629865.html

Suppose that you have such a network layer:

The first layer is the input layer, contains two neurons i1,i2, and the Intercept item B1, the second layer is the hidden layer, contains two neurons h1,h2 and intercept item B2, the third layer is the output O1,O2, each line is labeled WI is the layer and layer connection between the weight, activation function We default to sigmoid function.

Now give them an initial value, such as:

wherein, the input data i1=0.05,i2=0.10;

Output data o1=0.01,o2=0.99;

Initial weight w1=0.15,w2=0.20,w3=0.25,w4=0.30;

w5=0.40,w6=0.45,w7=0.50,w8=0.88

Target: Give input data I1,i2 (0.05 and 0.10) so that the output is as close as possible to the original output O1,o2 (0.01 and 0.99).

Step 1 Forward Propagation

1. Input layer----> Hidden layer:

Calculate the input weights of the neuron H1:

The output of the neuron H1 O1: (Here the activation function is the sigmoid function):

In the same vein, the output O2 of the neuron H2 can be calculated:

2. Hidden layer----> Output layer:

Calculates the values of the output layer neurons O1 and O2:

So the forward propagation process is over, we get the output value of [0.75136079, 0.772928465], and the actual value [0.01, 0.99] is still very far away, now we reverse propagation of errors, update weights, recalculate the output.

Step 2 Reverse Propagation

1. Calculate the total error

Total Error: (square error)

However, there are two outputs, so the error of O1 and O2 is calculated separately, and the total error is the sum of the two:

2. Weight update for hidden layer----> Output layer:

Taking the weight parameter W5 as an example, if we want to know how much influence W5 has on the overall error, we can use the whole error to find out the W5: (Chain Law)

The figure below is a more intuitive way to see how the error is transmitted in reverse:

Now let's calculate the value of each equation separately:

Calculation:

(This step is actually a derivative of the sigmoid function, relatively simple, you can deduce it yourself)

Calculation:

The last three are multiplied by:

This allows us to calculate the partial derivative of the overall error E (total) on the W5.

Looking back at the formula above, we found:

For ease of expression, it is used to indicate the error of the output layer:

Thus, the partial-derivative formula of the overall error E (total) on the W5 can be written as:

If the output layer error meter is negative, it can also be written as:

Finally, let's update the value of the W5:

(Which, is the learning rate, here we take 0.5)

Similarly, you can update W6,w7,w8:

3. Hidden layer----> implied weight update:

The method is in fact similar to the above, but there is a place to change, in the above calculation of the total error on the W5 bias, is from out (O1)---->net (O1)---->w5, but in the hidden layer between the weight update, is out (H1)---->net (H1)---->w1, and Out (H1) will accept the error from E (O1) and E (O2) Two places, so this place two are counted.

Calculation:

Calculate First:

In the same vein, the calculated:

The sum of the two gets the total value:

Re-calculation:

Finally, the three are multiplied by:

To simplify the formula, use Sigma (H1) to indicate the error of the hidden layer element H1:

Finally, update the weights of the W1:

Similarly, the amount can be updated W2,W3,W4 weight:

So the error back-propagation method is completed, and finally we re-calculate the updated weights, continue to iterate, in this example, after the first iteration, the total error E (overall) from 0.298371109 to 0.291027924. After iterating 10,000 times, the total error is 0.000035085 and the output is [0.015912196,0.984065734] (the original input is [0.01,0.99]), which proves the effect is good.

Iv. The most general situationis a three-layer artificial neural network, layer1 to Layer3 are input layer, hidden layer and output layer respectively. , some variables are defined first: the weights of the J neurons connected to the l-1 of the first layer of the section K, and the bias of the J neurons in the layer L, which indicates the input of the J neurons in the layer L, i.e.: the first The output of a J neuron, i.e., the activation function. L represents the maximum number of layers of a neural network and can also be interpreted as an output layer. The error (that is, the error between the actual value and the predicted value) of the J neuron in the L-layer is defined as: The cost function, which is still expressed in C

In the above 4 equations, the first equation is not difficult to understand, that is, the output of the evaluation function C bias.

The only difficult thing is the second equation, which gives the equation for calculating δl based on the error amount δl+1 the next layer. In order to prove the equation, we first base on δkl+1=? C/?zkl+1 re-expression of inferior ΔLJ =? C/?zlj. Here you can apply the chain rule:

In the last line, we swapped two items to the right of the next expression and replaced the definition of δkl+1. To evaluate the first item in the last line, note that:
As a derivative, we get

Generation Back (42) we get

This is represented in component form (BP2). The latter two are not very difficult after the completion of the BP2 proof, left to the reader to prove.

Four proveThe inverse Propagation algorithm (backpropagation) is the most common and effective algorithm currently used to train artificial neural networks (Artificial neural Network,ann). The main idea is: (1) The training set data input to the Ann input layer, through the hidden layer, finally reached the output layer and output results, this is the forward propagation of Ann, (2) because the output of Ann and the actual results have errors, then calculate the error between the estimate and the actual value, The error is propagated from the output layer to the hidden layer until propagated to the input layer, and (3) the values of various parameters are adjusted according to the error in the process of reverse propagation, and the above process is iterated until convergence. The idea of the inverse propagation algorithm is easy to understand, but the concrete formula is to be deduced step-by-step, so this paper emphatically introduces the derivation process of the formula. 1. Variable definitionis a three-layer artificial neural network, layer1 to Layer3 are input layer, hidden layer and output layer respectively. , a number of variables are defined: The weight of the first neuron of the first layer that is connected to the first layer, the bias of the first neuron in the first layer, and the input of the first neuron in the first layer, namely: the output of the first neuron of the first layer, namely: Represents an activation function. 2. Cost functionThe cost function is used to calculate the error between the Ann output value and the actual value. The common cost function is the two-time cost function, which represents the input sample, represents the actual classification, represents the predicted output, and represents the maximum number of layers of the neural network (quadratic). 3. Formulas and their derivationThis section describes the 4 formulas used by the reverse propagation algorithm and deduces them. If you do not want to understand the formula derivation process, look directly at the algorithm steps in section 4th. First, the error (i.e., the error between the actual value and the predicted value) generated in the first neuron of the layer is defined as: This article takes an example of an input sample as a description, at which the cost function is expressed as: Equation 1 (calculates the error generated by the last layer of the neural network): which represents the Hadamard product, which is used for the multiplication of point-to-point between matrices or vectors. The derivation process for equation 1 is as follows: equation 2 (from back to front, calculate the error generated by each layer of neural network): Derivation process: Equation 3 (Calculate the gradient of weights): Derivation process: Equation 4 (calculates the gradient of the bias): Derivation process: 4. Reverse propagation algorithm pseudo-code

Enter training set

For each sample x in the training set, sets the activation value corresponding to the input layer:
- Forward Propagation:

，

Calculate the error generated by the output layer:

Reverse propagation Error:

Using gradient descent (gradient descent), training parameters:

Inverse propagation algorithm (process and formula derivation)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More