Gradient Descent Neural Network Solution

Last Update:2014-07-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This document is the note of Andrew Ng's machine learning course at Coursera.

Overall steps

Determine the Network Model
Initialize Weight Parameters
For each sample, perform the following steps until convergence

Computing model output: Forward Propagation
Calculation cost function: Compare the gap between model output and actual output
Update weight parameter: Back Propagation

Determine the Network Model

The neural network model consists of three parts: input layer 1, intermediate layer 2,..., L-1, and output layer L. Each unit in the input layer represents a feature, and each unit in the output layer represents a category.

If our goal is to recognize a number written on a 20x20 pixel image (0 ~ 9), the input layer is a 400-dimensional vector, and each dimension is the pixel value at a certain position. The first unit of the output layer outputs The number between one, indicates the probability that the result belongs to 0, and other units, and so on.

The line between adjacent layers carries the weight parameter, which is expressed by θ. For example, \ (\ Theta _ {53} ^ {(1)} \) indicates the weight of the line segments of the 3rd nodes of layer1 and the 5th nodes of Layer2, these weights determine the role of the model. The goal of the neural network is to calculate the weights through samples.

Each node in the middle layer and the output layer (orange circle in the figure) is a logistic function g (z) = A, for example \ (Z_1 ^ {(2 )}\) indicates the input value of the first node on the second layer, and the output \ (A_1 ^ {(2)} \) (a indicates activation, excitation) is obtained by bringing the logistic function into it ).

Initialize Weight Parameters

If we initialize the weight parameter θ to 0, the effects of training on each node in each layer are the same. The utility of all vertices in each layer is equivalent to that of one node, avoid this situation. The solution is to randomly initialize the weight parameter between [-ε, ε], and ε is a preset small enough value.

% L_in(out): node number inside layer L_in(out).epsilon_init = 0.12;W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;

Training Model

The training process of neural networks is similar to that of linear regression and logistic regression:

Calculation cost function: J (θ)
Adjust ParametersθTo make the cost function value as small as possible

The next task is to calculate the output of the sample in the current model for each sample, find the cost function, and then update the weight parameter based on the output.

Model output: Forward Propagation

The output of the computing neural network model is relatively simple:

Add a (1) = x to bias (+ 1), that is, a (1) = [1 A (1)]
Z (2) = θ (1) A (1)
A (2) = g (z (2), add the bias item (+ 1), that is, a (2) = [1 A (2)]
Z (3) = θ (2) A (2)
A (3) = g (z (3) to obtain the expected result.

Note the bias items added in each step.

Cost functions

The formula is as follows:

This formula is similar to the price function defined by logistic regression. M is the number of samples. Because the neural network has K outputs, the cost function also calculates the cost of K outputs. The second line of the formula is set to prevent overfitting.

With the cost function, our next goal is to adjust the parameters to minimize the cost function. The reverse propagation algorithm is used.

Reverse propagation: Back Propagation

We use the inverse propagation algorithm to obtain the partial derivative of each weight coefficient of the cost function and update each weight coefficient. The key to reverse propagation is chain-based derivation.

Take the figure on the right as an example. Assume that \ ({\ partial J}/{\ partial \ Theta _ {45} ^ {(3) }}\) is required, that is, (layer 3rd, 5th nodes) to (4th layer, 4th nodes) edge weight deviation, we have:

\ ({\ Partial J}/{\ partial \ Theta _ {45} ^ {(3 )}} = [{\ partial J}/{\ partial A _ {4} ^ {(4) }}* {\ partial A _ {4} ^ {(4 )}} /{\ partial Z _ {4} ^ {(4) }}] * {\ partial Z _ {4} ^ {(4 )}} /{\ partial \ Theta _ {45} ^ {(3 )}}\)

This is the application of the chain rule. Right calculation formula:

① \ ({\ Partial J}/{\ partial A _ {4} ^ {(4) }}* {\ partial A _ {4} ^ {(4 )}} /{\ partial Z _ {4} ^ {(4) }}= A_4 ^ {(4)}-y_4 \)

② \ ({\ Partial Z _ {4} ^ {(4) }}/ {\ partial \ Theta _ {45} ^ {(3) }}= A_4 \)

Next, we continue to calculate \ ({\ partial J}/{\ partial \ Theta _ {53} ^ {(2) }}\), that is (layer 2nd, 3rd nodes) to (Layer 2, 3rd nodes) edge weight deviation, we have:

\ ({\ Partial J}/{\ partial \ Theta _ {53} ^ {(2 )}} = [[{\ partial J}/{\ partial A _ {1} ^ {(4) }}* {\ partial A _ {1} ^ {(4 )}} /{\ partial Z _ {1} ^ {(4) }}] * {\ partial Z _ {1} ^ {(4 )}} /{\ partial A _ {5} ^ {(3) }}* {\ partial A _ {5} ^ {(3 )}} /{\ partial Z _ {5} ^ {(3)}] * {\ partial Z _ {5} ^ {(3 )}} /{\ partial \ Theta _ {53} ^ {(2 )}} + {\ partial J}/{\ partial A _ {2} ^ {(4 )}}... \)

It can be seen that the deviation formula of the L layer overlaps with the L-1 layer, and we can directly use the intermediate results of the calculation, which is the meaning of "propagation. In the above formula, [brackets] are used to mark the Delta.

Delta :( th LLayer JNodes ).

This section describes the relationship between delta (L) and delta (L-1.

Extract the [brackets] in the formula to \ (\ delta_j ^ {(l)} \), that isLLayerJNodes), follow the above formula to replace the overlapping part, you can get:

\ (\ Delta_j ^ {(4)} = a_j ^ {(4)}-y_j \)
\ (\ Delta_j ^ {(3) }= (\ Theta ^ {(3)}) ^ t \ Delta ^ {(4 )}. * G' (Z ^ {(3 )})\)
\ (\ Delta_j ^ {(2) }= (\ Theta ^ {(2)}) ^ t \ Delta ^ {(3 )}. * G' (Z ^ {(2 )})\)
Note: we do not have \ (\ delta_j ^ {(1 )}\)

\ (\ Delta_j ^ {(l)} \) can be understood as a reflectionLLayerJNodes) the error between the ideal value and the actual value. The error in the output layer is the difference between the ideal value and the actual value. The error in the intermediate layer is obtained by the reverse propagation of the output layer. It can be understood that they have to pay the responsibility for the error in the next layer.

Relationship between partial derivative and Delta

This section describes the relationship between delta and partial derivative. The following formula can be used:

\ (\ Partial J (\ theta)/\ partial \ Theta _ {IJ} ^ {(l)} = a_j ^ {(l )} \ delta_ I ^ {(L + 1 )}\)

From this we can see that the partial derivative of a layer can be obtained by finding the Delta of a layer.

Summary

The principle of BP propagation is not complex. The key lies in the application of the chain-based biased rule in the reverse propagation process. In actual use, note the following:

Recurrence relationship between delta and Delta
Relationship between partial derivative and Delta

You don't have to forget these things for too long. You need to study them in time.

Training Process

The complete training process is as follows:

Gradient checking

If you use the above method for training, a bug may occur for some reason, causing the result to fail. However, a simple gradient check can eliminate these bugs to a great extent.

The idea of gradient check is simple. When we use the BP algorithm to repeat all the samples and obtain the gradient values of J on each θ (error-prone but efficient), we use gradient computing, calculate the gradient value in a numerical value (correct but time-consuming ). Compare the values of the two. If the difference is not big, it indicates that the BP algorithm is proceeding correctly. The following describes how to perform a gradient check:

The gradient check is time-consuming, so you can only use the gradient check in the first few rounds. You should turn off the gradient check in time.

for i = 1:n,thetaPlus = theta;thetaPlus(i) = thetaPlus(i) + EPSILON;thetaMinus = theta;thetaMinus(i) = thetaMinus(i) – EPSILON;gradApprox(i) = (J(thetaPlus) – J(thetaMinus))  /(2*EPSILON);end;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Gradient Descent Neural Network Solution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Gradient Descent Neural Network Solution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support