Summary
This article is an extension of the BackPropagation algorithm section of Andrew Ng's machine learning course on Coursera. The article is divided into three parts: the first part gives a simple neural network model and the specific flow of backpropagation (hereinafter referred to as BP) algorithm. The second part explains the BP algorithm flow by calculating the gradient of the first parameter in the first and second layers (parameters, also called weights in the Neural network), and gives the concrete derivation process. The third part uses a more intuitive illustration to explain the BP algorithm's workflow.
Note: 1. There are a lot of formulas in the text, reading and typesetting under the PC is better
2. In order to facilitate the discussion, the Bias unit is omitted, and the regularization term of cost function is omitted in the second part of the discussion.
3. If you are familiar with the character markers used in the NG course, the recommended reading order is: First, third, part two
First Part BP the specific process of the algorithm
Figure 1.1 shows a simple neural network model (eliminating the Bias unit):
Figure 1.1 A simple neural network model
Where the meaning of the character marker is consistent with the NG course:
\ (x_1, x_2, X_3 \) is the input value, i.e. \ (x^{(i)}\) three characteristics;
\ (z^{(L)}_{(j)}\) is the input value of the first J unit of the L layer.
\ (a^{(L)}_{(j)}\) is the output value of the J element of the L -layer. where a = g (z),G is the sigmoid function.
\ (\theta_{ij}^{(l)}\) the parameter (weight) matrix of layer L to l+1 .
Table 1.1 The specific flow of BP algorithm (Matlab pseudo code)
1 for i = m,
2 \ (a^{(1)} = x ^{(i)};\)
3 calculation using Feedforward propagation algorithm \ (a^{(2)}, a^{(3)};\)
4 \ (\delta^{(3)} = a^{(3)}-y^{(i)};\)
5 \ (\delta^{(2)} = (\theta^{(2)}) ^t * \delta^{(3)}. * G\prime (z^{(2)}); \) # 2nd operator '. * ' is a point multiplication, which is action by element
6 \ (\delta^{(2)} = \delta^{(2)} + a^{(2)} * \delta^{(3)};\)
7 \ (\delta^{(1)} = \delta^{(1)} + a^{(1)} * \delta^{(2)};\)
8 End;
Part II BP the detailed and derivation process of the algorithm steps
The goal of the BP algorithm is to provide gradient values for optimization functions (such as gradient descent, other advanced optimization methods), which are calculated using the BP algorithm to calculate the partial derivative of the cost function for each parameter in the following mathematical form: \ (\frac{\partial}{\partial{\ Theta^l_{ij}}}j (\theta) \), and the resulting value is stored in the Matrix \ (\delta^{(l)}\).
If the neural network has k output (K classes), then its J (Θ) is:
\[j (\theta) =-\frac{1}{m}\sum_{i=1}^m\sum_{k=1}^k[y^{(i)}_klog (H_\theta (x^{(i)}) _k) + (1-y_k^{(i)}) log (1-h_\Theta (X^{(i)}) _k)]\]
Next, take the calculation \ (\theta_{11}^{(1)}, \theta_{11}^{(2)}\) as an example to give the detailed steps of the BP algorithm. For the 1th feature of the I training use case, the cost function is:
\[j (\theta) =-[y^{(i)}_klog (H_\theta (x^{(i)}) _k) + (1-y_k^{(i)}) log (1-h_\theta (x^{(i)}) _k)] (Formula 1) \]
where \ (H_\theta (x) = a^{l} = g (z^{(L)}) \), G is the sigmoid function.
calculations \ (\theta_{11}^{(2)}\) :
\[\frac{\partial J (\theta)}{\partial \theta_{11}^{(2)}} = \frac{\partial J (\theta)}{\partial a_1^{3}} * \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} * \frac{\partial z_1^{(3)}}{\partial \theta_{11}^{(2)}} (Formula 2) \]
Remove the first two entries to the right of the equals sign in Type 2, and record them as \ (\delta_1^{(3)}\):
\[\delta_1^{(3)} = \frac{\partial J (\theta)}{\partial a_1^{3}} * \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} (Formula 3) \]
This gives the definition of \ (\delta^{(l)}\), which is:
\[\delta^{(l)} = \frac{\partial}{\partial z^{(l)}}j (\theta) ^{(i)} (Formula 4) \]
A detailed calculation of the Formula 3, the J (Θ) pair \ (z_1^{(3)}\) is biased (the calculation process is précis-writers Z):
\[\delta_1^{(3)} = \frac{\partial J (\theta)}{\partial a_1^{(3)}} * \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}}\]
\[=-[y * \FRAC{1}{G (z)}*g\prime (z) + (1-y) *\frac{1}{1-g (z)}* (-g\prime (z))]\]
\[=-[y* (1-g (z)) + (y-1) *g (z)]\]
\[=g (z)-y =a^{(3)-y}\]
It uses a good property of the sigmoid function:
\[g\prime (z) =g (z) * (1-g\prime (z)) (easy license) \]
This gives the fourth line of the BP algorithm in table 1.1.
Next observation 2 equals the last item to the right of the equal sign (\frac{\partial z_1^{(3)}}{\partial \theta_{11}^{(2)}}\):
where \ (z_1^{(3)}=\theta_{11}^{(2)}*a_1^{(2)}+\theta_{12}^{(2)}*a_2^{(2)}+\theta_{13}^{(2)}*a_3^{(2)}\), it is easy to:
\[\frac{\partial z_1^{(3)}}{\partial \theta_{11}^{(2)}}=a_1^{(2)} (Formula 5) \]
Then look back at the original Formula 2, substituting Type 3 and type 5, you can get:
\[\frac{\partial J (\theta)}{\partial \theta_{11}^{(2)}} = \delta_1^{(3)} * a_1^{(2)}\]
In this way, the sixth line of BP algorithm in table 1.1 is deduced.
This completes the calculation of \ (\theta_{11}^{(2)}\).
calculations \ (\theta_{11}^{(2)}\) :
\[\frac{\partial J (\theta)}{\partial \theta_{11}^{(1)}} = \frac{\partial J (\theta)}{\partial a_1^{3}} * \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}}*\frac{\partial z_1^{(3)}}{\partial a_1^{(2)}}*\frac{\partial a_1^{(2)}}{\partial z_1 ^{(2)}}*\frac{\partial z_1^{(2)}}{\partial \theta_{11}^{(1)}} (formula 6) \]
Similarly, according to the definition of \ (\delta^{(L)}\) in Equation 4, the top four items of the upper-right (i.e., 6) equals sign can be recorded as \ (\delta_1^{(2)}\). That
\[\delta_1^{(2)}=\frac{\partial J (\theta)}{\partial a_1^{3}} * \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}}*\frac{\ Partial z_1^{(3)}}{\partial a_1^{(2)}}*\frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} (Formula 7) \]
You can find that \ (\delta_1^{(3)}\) in Equation 3 is the first two items to the right of this equality.
so \ (\delta^{(l)}\) The meaning of this: it is used to save the last calculation of the partial results. When calculating \ (\delta^{(L-1)}\) , it is possible to use this part of the results to continue the downward-layered bias. in this way, the neural network can save a lot of repetitive operation when it is very complicated and has a large number of computations, thus effectively improving the learning speed of neural network.
Continue observer 7, the third item on the right of the equal sign is easy to calculate (known \ (z_1^{(3)}=\theta_{11}^{(2)}*a_1^{(2)}+\theta_{12}^{(2)}*a_2^{(2)}+\theta_{13}^{(2)}*a_3 ^{(2)}\):
\[\frac{\partial z_1^{(3)}}{\partial a_1^{(2)}} = \theta_{11}^{(2)} (Formula 8) \]
Equation 7 The last item to the right of the equals sign is:
\[\frac{\partial a_1^{(2)}}{\partial z_1^{(2)}}=g\prime (z_1^{(2)}) (Formula 9) \]
The \ (\delta_1^{(3)}\), type 8, 9-generation 7, can be obtained:
\[\delta_1^{(2)}=\delta_1^{(3)}*\theta_{11}^{(2)}*g\prime (z_1^{(2)}) (Formula 10) \]
In this way, the fifth line of BP algorithm in table 1.1 is deduced.
Next, we continue to calculate the last item to the right of the equals sign in Equation 6, known \ (z_1^{(2)}=\theta_{11}^{(1)}*a_1^{(1)}+\theta_{12}^{(1)}*a_2^{(1)}+\theta_{13}^{(1)}*a_3 ^{(1)}\), easy to get:
\[\frac{\partial z_1^{(2)}}{\partial \theta_{11}^{(1)}}=a_1^{(1)} (Formula 11) \]
The formula 10, type 11 into the beginning of the formula 6 can be:
\[\frac{\partial J (\theta)}{\partial \theta_{11}^{(1)}} =\delta_1^{(2)} * a_1^{(1)}\]
Thus, the seventh line of the BP algorithm in table 1.1 can be obtained.
This completes the calculation of \ (\theta_{11}^{(1)}\).
Part III BP an intuitive illustration of the algorithm
Neural Network Learning Algorithm diagram overview
Given a function f (x), what is the first requirement of the object? is its input value, which is the argument x. What about f (g (x))? That is, the G (X) as a whole as its input value, its arguments. So g (X) This whole is its first request to the object. therefore, the derivative object of a function is its input value, which is its argument. in order to find out this, we can not only find the chain law of multivariate function biasing.
Figure 3.1 Bottom-up, each box is the input value of the above box, which is the argument of the function in the box above. This diagram shows the relationship between the data in the neural network-who is the input value, the figure is very clear. The previous paragraph refers to a function of the derivation of the object is its input value, then through Figure 3.1 can be very convenient to use the chain law, but also can clearly observe the BP algorithm flow (the next section will give a more specific BP flowchart).
The model diagram of Figure 1.1 neural network, which is given in the first place, should be easy to understand the meaning of Figure 3.1, which roughly shows the neural network Learning (training) process. The feedforward propagation algorithm is calculated from the bottom up, and finally can get the \ (a^{(3)}\), which can be calculated further by J (Θ). The BP algorithm is top-down and the layer is biased, and finally the gradient value of each parameter is obtained. The following section will take a closer look at the topic of this article, which is the flowchart solution of BP algorithm.
Figure 3.1 Overview of neural network learning Algorithms
BP an intuitive illustration of the algorithm
Figure 3.2 shows the calculation flow of the BP algorithm, and the specific calculation steps are attached. The flow of the BP algorithm is clearly visible in this diagram: The top-down (corresponding neural network model is the self-output layer to the input layer) of the layer to be biased. Because of the complexity of neural networks, people are always bogged down in the quagmire of multiple function biases: Which variable should be derivative? Figure 3.2 Straighten out the relationship between the data points in the neural network, who is the input value, who is the function of the clear, and then you can freely use the chain law.
Figure 3.2 BP Algorithm flow
So the BP algorithm, which is the inverse propagation algorithm, is the process of \theta_{ij}^{the cost function (J (Θ)) to each parameter (\ (l)}\)), which corresponds to the bias conduction of the self-output layer to the input layer layer in the neural network model. In Figure 3.2, when the reverse propagation to the \ (a_1^{(2)}\) node, encountered a fork junction: Select the \ (\theta_{11}^{(2)}\) to be biased, you can get the second layer of the parameter gradient. And if the choice of \ (a_1^{(2)}\) This path continues to be biased, you can continue to the downward (that is, the output layer) propagation, continue to go down to the bias guide, the first layer of the parameter gradient can be obtained, so that the purpose of the BP algorithm is realized. Before selecting a fork junction, use \ (\delta^{(l)}\) to save some of the results when the fork is reached (the second part of this article (\delta^{(L)}\) is precisely defined. Then if you choose to continue to bias the downward direction, you can also use this part of the results to continue to go down to the bias guide. Thus, a large number of repetitive computations are avoided, which effectively improves the learning speed of the neural network algorithm.
Therefore, two salient features of BP algorithm can be observed:
1) from the output layer to the input layer (that is, reverse propagation), by layer-biased guidance, in this process gradually get the parameters of each layer gradient.
2) in the process of reverse propagation, using \ (\delta^{(l)}\) to save some of the results, avoid a large number of repetitive operations, so the algorithm has excellent performance.
1. Please indicate the source at the beginning of the text, thank you.
2. If you find this article helpful, you can also ask me to have a cup of tea. : )
Derivation and visual illustration of BackPropagation algorithm