The backward propagation algorithm of sparse automatic coding (BP)

Source: Internet
Author: User

Given a training set of M training samples, the gradient descent method is used to train a neural network, and for a single training sample (x,y), the loss function of the sample is defined:

So the loss function for the entire training set is defined as follows:

The first item is the mean value of the variance of all samples. The second item is a normalized item (also called a weight decay), which is designed to reduce the update speed of weighted connection weights and to prevent overfitting.

Our goal is to minimize the function J(w,b) about W and b . In order to train the neural network, each parameter is initialized to a very small random value close to 0 (e.g. random values from the normal distribution Normal(0,ε2 ) is sampled, the ε is set to 0.01, and then the batch gradient descent algorithm is used to optimize it. Since J(W,b) is a non-convex function, gradient descent is easy to converge to local optimality, but in practice, the gradient drop can often achieve good results. Finally, note the importance of random initialization parameters, not all initialized to 0. If the initial values of all parameters are equal, then all the hidden nodes will be output equal, because the training set is the same, that is, if the parameters of each model are the same, the output will obviously be the same, so that regardless of how many times the parameter is updated, all parameters will be equal. Random initialization of each parameter is to prevent this from happening.

The gradient drops each time the iteration is updated with the parameters W and b as follows:

where α is the learning rate. The key to the above iteration is to calculate the partial derivative. We will give a directional propagation algorithm that can efficiently calculate these partial derivatives.

By the total loss function formula above, it is easy to get the partial derivative formula as follows:

The idea of the inverse propagation algorithm is that given a training sample (x,y), the first "forward propagation" is to calculate the activation value of all nodes in the whole network, including the output value of the output node. Then for the l -layer node i , calculate its "residuals", which is used to measure the extent to which the node has an impact on the output residuals. For output nodes, we can directly compare the residual between the network's activation value and the true target value, i.e. (nl layer is the output layer). For the hidden layer nodes, we use the weighted average of the residual of the L +1 layer and the activation value of the L layer.

The steps for the inverse propagation algorithm are given in detail below:

1. Forward feed propagation to calculate the activation values for all nodes in each layer

2. Residuals for node i of the output layer (nth L layer ):

It is important to note that the sum of all outputs of the L layer node i is represented,F is the activation function, for example, and so on, and the output value of the hypothetical function of the last layer (output layer) is the activation value of the layer node.

3. For

4. Calculate the partial derivative:

The following method is used to rewrite the algorithm using matrix-vectorization operations. Where "" represents the point multiplication in matlab. For the same quantification, the same treatment is done, i.e..

The BP algorithm is rewritten as follows:

1. Forward feed propagation to calculate the activation values for all nodes in each layer

2. Residuals for node i of the output layer (the first nl layer):

3. For

4. Calculate the partial derivative:

Note: In steps 2nd and 3rd above, we need to calculate it for each node i . The assumption is that the sigmoid activation function, which has stored the activation values of all the nodes during forward propagation, takes advantage of our

Neural network of sparse automatic coding

Derivation of the derivative of the sigmoid activation function: For the sigmoid function f(z) = 1/(1 + exp (− z)), its guide function is F' ( z) = f(z) (1− f(z)). Can be calculated in advance, this is used in the above mentioned.

Finally, the complete gradient descent method is given. In the following pseudo-code, all are matrices, which are vectors.

1. For each layer, i.e. all L,, (set to all 0 matrices or vectors)

2. Starting with the first training sample, one to the last (the first training sample):

A. Using reverse propagation to calculate and

B..

C..

3. Update the parameters:

Now we can repeat the iterative steps of the gradient descent method to reduce the value of the loss function and train our neural network.

Learning Source: Http://deeplearning.stanford.edu/wiki/index.php/Backpropagation_Algorithm

The backward propagation algorithm of sparse automatic coding (BP)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.