This document is the note of Andrew Ng's machine learning course at Coursera.
Overall steps
- Determine the Network Model
- Initialize Weight Parameters
- For each sample, perform the following steps until convergence
- Computing model output: Forward Propagation
- Calculation cost function: Compare the gap between model output and actual output
- Update weight parameter: Back Propagation
Determine the Network Model
The neural network model consists of three parts: input layer 1, intermediate layer 2,..., L-1, and output layer L. Each unit in the input layer represents a feature, and each unit in the output layer represents a category.
If our goal is to recognize a number written on a 20x20 pixel image (0 ~ 9), the input layer is a 400-dimensional vector, and each dimension is the pixel value at a certain position. The first unit of the output layer outputs The number between one, indicates the probability that the result belongs to 0, and other units, and so on.
The line between adjacent layers carries the weight parameter, which is expressed by θ. For example, \ (\ Theta _ {53} ^ {(1)} \) indicates the weight of the line segments of the 3rd nodes of layer1 and the 5th nodes of Layer2, these weights determine the role of the model. The goal of the neural network is to calculate the weights through samples.
Each node in the middle layer and the output layer (orange circle in the figure) is a logistic function g (z) = A, for example \ (Z_1 ^ {(2 )}\) indicates the input value of the first node on the second layer, and the output \ (A_1 ^ {(2)} \) (a indicates activation, excitation) is obtained by bringing the logistic function into it ).
Initialize Weight Parameters
If we initialize the weight parameter θ to 0, the effects of training on each node in each layer are the same. The utility of all vertices in each layer is equivalent to that of one node, avoid this situation. The solution is to randomly initialize the weight parameter between [-ε, ε], and ε is a preset small enough value.
% L_in(out): node number inside layer L_in(out).epsilon_init = 0.12;W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;
Training Model
The training process of neural networks is similar to that of linear regression and logistic regression:
- Calculation cost function: J (θ)
- Adjust ParametersθTo make the cost function value as small as possible
The next task is to calculate the output of the sample in the current model for each sample, find the cost function, and then update the weight parameter based on the output.
Model output: Forward Propagation
The output of the computing neural network model is relatively simple:
- Add a (1) = x to bias (+ 1), that is, a (1) = [1 A (1)]
- Z (2) = θ (1) A (1)
- A (2) = g (z (2), add the bias item (+ 1), that is, a (2) = [1 A (2)]
- Z (3) = θ (2) A (2)
- A (3) = g (z (3) to obtain the expected result.
Note the bias items added in each step.
Cost functions
The formula is as follows:
This formula is similar to the price function defined by logistic regression. M is the number of samples. Because the neural network has K outputs, the cost function also calculates the cost of K outputs. The second line of the formula is set to prevent overfitting.
With the cost function, our next goal is to adjust the parameters to minimize the cost function. The reverse propagation algorithm is used.
Reverse propagation: Back Propagation
We use the inverse propagation algorithm to obtain the partial derivative of each weight coefficient of the cost function and update each weight coefficient. The key to reverse propagation is chain-based derivation.
Take the figure on the right as an example. Assume that \ ({\ partial J}/{\ partial \ Theta _ {45} ^ {(3) }}\) is required, that is, (layer 3rd, 5th nodes) to (4th layer, 4th nodes) edge weight deviation, we have:
\ ({\ Partial J}/{\ partial \ Theta _ {45} ^ {(3 )}} = [{\ partial J}/{\ partial A _ {4} ^ {(4) }}* {\ partial A _ {4} ^ {(4 )}} /{\ partial Z _ {4} ^ {(4) }}] * {\ partial Z _ {4} ^ {(4 )}} /{\ partial \ Theta _ {45} ^ {(3 )}}\)
This is the application of the chain rule. Right calculation formula:
① \ ({\ Partial J}/{\ partial A _ {4} ^ {(4) }}* {\ partial A _ {4} ^ {(4 )}} /{\ partial Z _ {4} ^ {(4) }}= A_4 ^ {(4)}-y_4 \)
② \ ({\ Partial Z _ {4} ^ {(4) }}/ {\ partial \ Theta _ {45} ^ {(3) }}= A_4 \)
Next, we continue to calculate \ ({\ partial J}/{\ partial \ Theta _ {53} ^ {(2) }}\), that is (layer 2nd, 3rd nodes) to (Layer 2, 3rd nodes) edge weight deviation, we have:
\ ({\ Partial J}/{\ partial \ Theta _ {53} ^ {(2 )}} = [[{\ partial J}/{\ partial A _ {1} ^ {(4) }}* {\ partial A _ {1} ^ {(4 )}} /{\ partial Z _ {1} ^ {(4) }}] * {\ partial Z _ {1} ^ {(4 )}} /{\ partial A _ {5} ^ {(3) }}* {\ partial A _ {5} ^ {(3 )}} /{\ partial Z _ {5} ^ {(3)}] * {\ partial Z _ {5} ^ {(3 )}} /{\ partial \ Theta _ {53} ^ {(2 )}} + {\ partial J}/{\ partial A _ {2} ^ {(4 )}}... \)
It can be seen that the deviation formula of the L layer overlaps with the L-1 layer, and we can directly use the intermediate results of the calculation, which is the meaning of "propagation. In the above formula, [brackets] are used to mark the Delta.
Delta :( th
LLayer
JNodes ).
This section describes the relationship between delta (L) and delta (L-1.
Extract the [brackets] in the formula to \ (\ delta_j ^ {(l)} \), that isLLayerJNodes), follow the above formula to replace the overlapping part, you can get:
- \ (\ Delta_j ^ {(4)} = a_j ^ {(4)}-y_j \)
- \ (\ Delta_j ^ {(3) }= (\ Theta ^ {(3)}) ^ t \ Delta ^ {(4 )}. * G' (Z ^ {(3 )})\)
- \ (\ Delta_j ^ {(2) }= (\ Theta ^ {(2)}) ^ t \ Delta ^ {(3 )}. * G' (Z ^ {(2 )})\)
- Note: we do not have \ (\ delta_j ^ {(1 )}\)
\ (\ Delta_j ^ {(l)} \) can be understood as a reflectionLLayerJNodes) the error between the ideal value and the actual value. The error in the output layer is the difference between the ideal value and the actual value. The error in the intermediate layer is obtained by the reverse propagation of the output layer. It can be understood that they have to pay the responsibility for the error in the next layer.
Relationship between partial derivative and Delta
This section describes the relationship between delta and partial derivative. The following formula can be used:
\ (\ Partial J (\ theta)/\ partial \ Theta _ {IJ} ^ {(l)} = a_j ^ {(l )} \ delta_ I ^ {(L + 1 )}\)
From this we can see that the partial derivative of a layer can be obtained by finding the Delta of a layer.
Summary
The principle of BP propagation is not complex. The key lies in the application of the chain-based biased rule in the reverse propagation process. In actual use, note the following:
- Recurrence relationship between delta and Delta
- Relationship between partial derivative and Delta
You don't have to forget these things for too long. You need to study them in time.
Training Process
The complete training process is as follows:
Gradient checking
If you use the above method for training, a bug may occur for some reason, causing the result to fail. However, a simple gradient check can eliminate these bugs to a great extent.
The idea of gradient check is simple. When we use the BP algorithm to repeat all the samples and obtain the gradient values of J on each θ (error-prone but efficient), we use gradient computing, calculate the gradient value in a numerical value (correct but time-consuming ). Compare the values of the two. If the difference is not big, it indicates that the BP algorithm is proceeding correctly. The following describes how to perform a gradient check:
The gradient check is time-consuming, so you can only use the gradient check in the first few rounds. You should turn off the gradient check in time.
for i = 1:n,thetaPlus = theta;thetaPlus(i) = thetaPlus(i) + EPSILON;thetaMinus = theta;thetaMinus(i) = thetaMinus(i) – EPSILON;gradApprox(i) = (J(thetaPlus) – J(thetaMinus)) /(2*EPSILON);end;