The solution of parameters in neural network: Forward and backward propagation algorithms

Source: Internet
Author: User
Tags scalar

The basic knowledge of neural network can refer to the basic knowledge of neural network, the basic thing is very good, and then the solution of the parameters in the neural network is explained.

Some variables are explained:

The circle labeled "" is called the offset node , which is the Intercept item.

In this case, the neural network has parameters, in which (in the following equation) is the connection parameter between the element of the first layer and the element of the first layer (in fact, the weight on the connecting line, note the order of the label), is the offset of the element of the first layer.

Number of nodes representing the first layer (offset unit not included)

An activation value (output value) that represents the cell of the first layer. At that time, that is, the first characteristic of the sample input value.

The input weighted sum (including the offset unit) of the unit of the first layer is used to denote the elements of the level, so, for example, the F (.) Here is the activation function, which is generally the sigmoid function.

Represents the final output, if the network final output is a value, then a scalar, if the final output more than one value (the sample's label y is also a vector), then it is a vector.

For: we have

Use these formulas to convert the vectors into another formula (the activation function is also extended to be expressed as a vector (component), i.e.):

This is what we define WIJ as the above element, so that it is directly based on the wij of the label to form a matrix, instead of changing or transpose the like.

Assuming W and b are known, we get the activation values of each cell in the neural network based on the iterative formula of the last two lines.

The above calculation steps are called forward propagation .

Because we don't know about W and B, we have to learn these parameters through training, so how do we train that? We know some training samples (note that this is not necessarily a scalar, but), we still use the gradient descent method to get the convergent model parameter values by iteration.

In each iteration, we assume that W and B are known, and then we get the activation values in the model through a forward propagation.

We calculate the squared difference between the last output of the model in each iteration and the tag of the sample (if all are vectors, the squared difference is that each element in the vector corresponds to the difference and then the sum of squares), and then the sum of the squares of all the samples is calculated. Then get the cost function for each iteration (with L2 regular items):

Although each of these neurons is similar to the logistic regression model, there is no likelihood function method in cost function, or the mean square error used. In fact, the results of these two methods are the same (remember that the formula obtained by the likelihood function is the same as the mean square error formula). Similarly, in the L2 regular term, the parameter does not include the bias parameter, which is the parameter of the constant term.

If the activation function used in the model is the sigmoid function, then the output value of each node in the final output layer of the model is between (0,1), then the elements in the sample's tag vector are required (0,1), so we first need to change the marker range of the sample in some way (like standardization, If the model's activation function uses the Tanh function, then here is the normalized transformation sample tag value (Translator Note: that is), to ensure that its range is between (0,1).

We have the cost function, and then we need to minimize this function, we use the gradient descent method, which is the two formulas:

........................................................................................................................................................................... (1)

However, we cannot directly bias the cost function for each parameter, because these parameters are interrelated and there is no way to find an explicit bias. We can only curve the salvation. How do we ask for a biased guide? We use the inverse propagation algorithm to obtain, specifically:

We are not the whole operation, but each sample and through the model to get squared, as small, that is, each time we process a sample cost function, using the inverse propagation algorithm to calculate and, once we find the partial derivative of each sample, we can push the export of the overall cost function Derivative of:

....................................................................................................................... (2)

Here's how we're going to ask for each and every sample.

Given a sample, we first perform a "forward conduction" operation to calculate all the activation values in the network, including the output values.

At this time we do not regard the individual as the input, but the all and superposition of each unit and as input, so we consider the biased guide.

The deviations obtained are called "residuals" for each unit, as follows:

(1) Residual of the last layer (output layer) unit:

We can find out all the activation values in the network through the forward propagation algorithm, this is also possible, for example, when the sigmoid function, so this residual error is found.

(2) The remainder of each layer we are seeking:

The residuals of the l= units are first asked:

Because equal, so the red box inside is a compound function derivation.

The third to the penultimate formula only k=i when the derivative exists, Wij is a factor, so this step is also very well understood.

The meaning of the representative in the final result is:

And then multiply it, that's it.

Then the residuals of each layer unit of l= are obtained:

Replace the relationship in the above with the relationship, you can get

At this time the residual of each unit is calculated, next calculate the partial derivative we need, the formula is as follows:

After each deviation of the cost function of each sample is calculated, we can find out the various biases of all the cost function according to the formula (2). Then, according to the formula (1), we find the parameter after the iteration of the gradient descent method, and then we iterate the iteration, and finally found the convergence value of the parameter. It should be emphasized here that the parameters are initially initialized randomly, not all of them. If all parameters use the same value as the initial value, then all hidden layer cells will eventually get the same function as the input value (that is, for all, the same value will be taken, then for any input will have:). The purpose of random initialization is to invalidate the symmetry . To solve a neural network, we need to initialize each parameter to a small, nearly 0 random value (for example, a random value generated using a normal distribution, which is set to)

Reference: http://deeplearning.stanford.edu/wiki/index.php/inverse conduction algorithm

The solution of parameters in neural network: Forward and backward propagation algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.