Principle and derivation of multi-layer neural network BP algorithm

Source: Internet
Author: User

First, what is an artificial neural network? Simply put, a single perceptron as a neural network node, and then use such nodes to form a hierarchical network structure, we call this network is the artificial neural network (I own understanding). When the level of the network is greater than or equal to 3 layers (input layer + hidden layer (greater than or equal to 1) + output layer), we call it a multilayer artificial neural network.

1. Selection of nerve cells

So what kind of perceptron should we use as neural network nodes? In the previous article we introduced the Perceptron algorithm, but the following problems are present when used directly:

1) output from the Perceptron training rule

Since the sign function is not a continuous function, this makes it non-micro and therefore cannot use the gradient descent algorithm above to minimize the loss function.

2) The output in the increment rule is;

Each output is a linear combination of inputs, so that when multiple linear elements are connected together, only a linear combination of inputs can eventually be obtained, which is not very different from only one perceptron unit node.

In order to solve the above problems, on the one hand, we can not directly use linear combination of direct output, need to add a processing function at the time of output, on the other hand, the added processing function must be micro, so that we can use gradient descent algorithm.

The function that satisfies the above conditions is much, but the most classic is the sigmoid function, also called the logistic function, this function can compress any number inside to (0,1), so this function is also called the squeezing function. To make the input of this function more normalized, we add a threshold value to the linear combination of inputs, which causes the linear combination of the input to be 0 as the cutoff point.

sigmoid function:

Its function is shown in Curve 1.1.

Figure 1.1 sigmoid function curve [2]

One important feature of this function is his derivative:

With this feature, it is much easier to calculate the gradient drop.

There are also hyperbolic functions tanh can also be used to replace the sigmoid function, the two graphs are similar.

2. Inverse propagation algorithm called BP algorithm (back propagation)

Now, we can build a multilayer neural network using the perceptron of the sigmoid function described above, for simplicity, here we use a three-layer network to analyze. Assume that network topology 2.1 shows.

Figure 2.1 BP network extension structure [3]

The operation of the network is as follows: When a sample is entered, the eigenvector of the example is obtained, then the input value of the Perceptron is obtained according to the weight vector, then the output of each perceptron is computed using the sigmoid function, and the output is used as the input of the next layer perceptron, and so on, until the output layer.

So how do you determine the weight vector for each perceptron? At this point we need to use the reverse propagation algorithm to gradually optimize. Before we formally introduce the reverse propagation algorithm, we'll go ahead and analyze it.

In the previous article on the Perceptron, in order to get the weight vector, we can adjust the weight vector by minimizing the loss function. This method also applies to solve the weight vector here, first we need to define the loss function, because the output layer of the network has multiple output nodes, we need to sum the difference square of each output node of the output layer. So the loss function of each training sample is as follows: (add a 0.5 front to facilitate the use of the derivative later)

In a multilayer neural network, the error surface may have multiple local minima, which means that a local minimum, rather than a global minimum, can be found using a gradient descent algorithm.

Now that we have a loss function, we can adjust the input weight vector in the output node according to the loss function, which is similar to the random gradient descent algorithm in the Perceptron, then adjusts the weights from the back to the layer, which is the idea of the inverse propagation algorithm.

Backward propagation algorithm for Feedforward networks with two-layer sigmoid units:

1) Randomly initialize ownership values in the network.

2) For each training sample, perform the following actions:

A) According to the input of the instance, the output layer of each unit is computed from the front to the back. The error entry for each cell of each layer is then reversed from the output layer.

B) for each unit K of the output layer, calculate its error term:

C) for each hidden unit h in the network, calculate its error term:

D) Update each weight value:

Symbol Description:

Xji: The input of node I to Node J, and Wji represents the corresponding weight value.

Outputs: represents the output Layer node collection.

The algorithm is similar to the gradient descent algorithm of Delta law, and the algorithm is analyzed as follows:

1) The renewal of weights, similar to the Delta law, depends mainly on the learning rate, the input of the weights, and the error terms of the unit.

2) for the output layer unit, its error term is (t-o) multiplied by the derivative of the sigmoid function ok (1-OK), which is different from the Delta rule error, the Delta Rule error is (T-O).

3) for hidden layer units, because of the lack of a direct target value to calculate the error of the hidden element, it is necessary to calculate the error of the hidden layer in an indirect way to weighted sum of the errors of each element affected by the hidden unit h, each error weight is Wkh, and the Wkh is the weight of the hidden unit h to the output unit K.

3. Derivation of inverse propagation algorithm

The derivation process of the algorithm is mainly the process of minimizing the loss function using the gradient descent algorithm, and now the loss function is:

For each weight value in the network Wji, calculate its derivative:

1) If j is the output layer unit of the network

The derivation of NETJ:

which

  

  

So there are:

To make the expression concise, we use:

The weight changes toward the negative gradient direction of the loss function, hence the amount of the right value to change:

2) If J is a hidden unit in the network

Because the value of W in the hidden unit indirectly affects the input through the next layer, the derivation is done by layer-by-split:

Because:

So:

Again, we use:

So the amount of weight change:

4, the improvement of the algorithm

The application of the inverse propagation algorithm is very extensive, in order to meet a variety of different needs, produced a lot of different variants, the following describes two variants:

1) Increase the impulse

This method mainly modifies the law of weights updating. His main idea is to make the updated part of the weight of the nth iteration dependent on the weight of the n-1.

where 0<=a<1: a coefficient called impulse. Adding the impulse to a certain extent to increase the effect of the search step, so that the faster convergence. On the other hand, because the multilayer network easily leads to the loss function convergence to the local minimum, the impulse term can somehow pass through some narrow local minima and reach a smaller place.

2) Learn the free-loop network of any depth

In the above introduction of the reverse propagation algorithm actually only three layers, that is, only a layer of hidden layer, if there are many hidden layers should be how to deal with?

It is assumed that the neural network has a m+2 layer, that is, the hidden layer of M layer. In this case, a reverse propagation algorithm with M hidden layer can be obtained only by changing one place. The value of the error of unit R of the K-layer is calculated by the error term of the deeper k+1 layer:

5. Summary

The inverse propagation algorithm is mainly summarized from the following aspects:

1) Local Minimum value

For multilayer networks, the error surface may contain several different local minima, and a gradient drop may result in a local minimum. The methods of mitigating local minima mainly include increasing impulse terms, using random gradient descent, and using different initial weights to train the network.

2) Excessive weight value

When the number of hidden nodes, the more layers, weights multiplied. The increase of weight means that the higher the dimension of the corresponding space is, the excessively fitting of the late training is easily caused by the high dimensionality.

4) algorithm termination policy

When the number of iterations reaches the set threshold, or if the loss function is less than the set threshold, or

3) over fitting

When there is too much training in the network, there may be situations where fitting is possible. The two main methods are solved: One is the method of attenuation of the usage value, that is to reduce each weight value with some smaller factor in each iteration, and the other is to use the validation set to find the weights that make the error of the validation set to be minimized, and to use cross-validation for the smaller training set.

In addition, there are a lot of problems in the neural network can be discussed, such as the number of hidden nodes, whether the step is fixed, and not discussed here.

Prospect:

There have been more researches on neural networks, and many new extension algorithms have been produced, such as convolutional neural networks, deep neural networks, and impulsive neural networks. In particular, the impulse neural network is called the third generation Neural network, these neural network algorithms will have more and more applications in the future, such as deep neural network in image recognition, speech recognition and other fields have achieved very good results.

Finally, this article mainly refers to Mitchell's machine learning textbook written, if there is a mistake to welcome correct!

Reference documents:

[1] Tom M. Mitchell, machine learning.

[2] Keep going, http://www.cnblogs.com/startover/p/3143763.html

[3] Hahack, http://hahack.com/reading/ann2/

Principle and derivation of multi-layer neural network BP algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.