Inverse propagation algorithm
We already know how forward propagation is calculated in the previous section. That is, given how the x calculates y through each node.
Then there is the question of how we can determine the weights of each neuron, or how to train a neural network.
In the traditional machine learning algorithm we use the gradient descent algorithm to do the weight update:
\[\theta_j:=\theta_j-\alpha\frac\delta{\delta\theta_j}j (θ) \]
Even if the output value is used to construct a loss function , and then the update rule of gradient descent is performed with the weight as an argument , the loss function is minimized.
In fact, the BP algorithm (then the reverse propagation algorithm is referred to as the BP algorithm, personal habits) did a similar thing, but there are some differences. This is determined by the topological structure of the neural network.
We can compare the neural network and the traditional machine learning algorithm forward calculation:
Neural Networks:
\[\begin{align}\nonumber&\vec{a}_1=f (w_1\centerdot\vec{x}) \ \nonumber&\vec{a}_2=f (W_2\centerdot\vec{a} _1) \ \nonumber&\vec{a}_3=f (w_3\centerdot\vec{a}_2) \ \nonumber&\vec{y}=f (w_4\centerdot\vec{a}_3) \ \ \ Nonumber\end{align}\]
Perception Machine:
\[y=h (x) =\sum_{i=0}^n\theta_ix_i=\theta^tx\]
So in fact, from the topological structure, the perceptual machine can even be regarded as a single-layer neural network. In this single-layer neural network structure, the derivation of the weighting is simple, and the direct deviation can be obtained. However, in the multi-layered complex function neural network structure, each node needs to be derivative, the most easily thought of the compound function derivation method is the chain Law , but the chain law has some problems.
We first use a simple example to illustrate the chain rule, or why the forward derivative is not good, and then lead to reverse derivation to illustrate the inverse propagation algorithm.
A simple example
For a more complex expression:\ (e= (a+b) * (b+1) \), we can introduce two intermediate variables:\ (c=a+b\),\ (d=b+1\) and make \ (a=2\), \ (b=1\).
In accordance with the principle of chain derivation (e\) to \ (b\) , the deviation of the guide:
\[\frac {\partial e} {\partial b} = \frac {\partial e} {\partial c} \cdot \frac {\partial c} {\partial B} + \f RAC {\partial e} {\partial d} \cdot \frac {\partial d} {\partial b} = d \cdot 1+c\cdot1=2b+a+1=5\]
\ (e\) to \ (a\) for biased guidance:
\[\frac {\partial e} {\partial A} = \frac {\partial e} {\partial c} \cdot \frac {\partial c} {\partial A} = d\ Cdot1=b+1=2\]
Calculation Diagram Understanding
We can also use the calculation diagram to understand the process.
The original formula is represented by a calculation diagram:
Forward calculation:
and the partial derivative values are calculated separately for each node:
So how do we use this calculation diagram to understand the process of the compound function to seek the biased derivative?
For \ (\frac {\partial e} {\partial a}\), the path to the calculated graph is a-c-e that the values on the same path node are multiplied directly:\ (\frac {\partial e} {\partial a}=1 \cdot2\).
For \ (\frac {\partial e} {\partial b}\), the only difference is that the biased value passes through two paths, and b-d-e b-c-e the values between the different paths and between them need to be added:\ (\frac {\partial E } {\partial B}=1 \cdot 2 + 1 \cdot 3\).
Disadvantages
The value of the partial derivative of the compound function can also be obtained by the forward chain derivation rule or by traversing the path on the calculation chart. So why use a BP algorithm?
Neural networks are often a very large network, so the computational cost is very high, and the method described above has a very fatal flaw in computational cost. We use \ (e\) respectively to \ (a\) and \ (b\) in the process of derivation of the c-e path is used two times, or \ (\frac {\partial e} {\partial C}\) was calculated two times.
That is to say: The positive chain derivation will exist computational redundancy .
Reverse propagation
Take the diagram as an example, then the positive derivative is derived from the bottom upward, this figure shows the partial derivative of each node for B, so the traversal of all nodes can only be obtained corresponding to a single input of the partial derivative.
The reverse derivative is the top-down derivative, it is seen that after the completion of the node, the corresponding all the input of the partial derivative can be obtained.
However, there is no free lunch in the world, the reverse propagation is faster than forward propagation in time complexity, but the reverse propagation must be calculated before the calculation, in fact, the reverse propagation in the algorithmic complexity of the algorithm to use the space complexity in exchange for the complexity of time .
Analysis of algorithm complexity
Complexity analysis Reference reference 2.
The reverse propagation in neural networks
We take this diagram as an example to illustrate the computational process of BP algorithm in neural network.
I omitted the reason for using the sigmoid function in the previous section, as explained here, because the sigmod derivative is very good in nature, and the derivative of sigmod can be represented by the function itself, namely:
\[y ' =y (1-y) \]
We use the mean square error as the loss function, the training sample is \ ((\vec{x},\vec{t}) \):
To minimize the loss function:
\[loss = (y_i-t_i) ^2\]
Error in output layer node \ (i\) :
\[\delta_i=y_i (1-y_i) (t_i-y_i) \]
Then bring in nodes, such as node 8:
\[\delta_8=y_1 (1-y_1) (t_1-y_1) \]
For hidden-layer nodes are:
\[\delta_i=a_i (1-a_i) \sum_{k\in{outputs}}w_{ki}\delta_k\]
Just bring it in one node 4:
\[\delta_4=a_4 (1-a_4) (w_{84}\delta_8+w_{94}\delta_9) \]
The above is the error term of the method, the error term into the gradient drop to get the weight of the updated formula:
\[w_{ji}\gets W_{ji}+\eta\delta_jx_{ji}\]
Here \ (x_ji\) is the value that node \ ( i\) passes to node \ (j\) .
So for example, the weight \ (w_{84}\) Update method is:
\[w_{84}\gets W_{84}+\eta\delta_8 A_4\]
Reference
- An article using a calculation diagram to explain the inverse propagation algorithm, highly recommended
- Analysis of complexity of BP algorithm
- 0 Basic Introductory deep Learning (3)-neural networks and reverse propagation algorithms
Inverse propagation algorithm