Starting today to learn the pattern recognition and machine learning (PRML), chapter 5.2-5.3,neural Networks Neural network training (BP algorithm)

Source: Internet
Author: User

This is the essence of the whole fifth chapter, will focus on the training method of neural networks-reverse propagation algorithm (BACKPROPAGATION,BP), the algorithm proposed to now nearly 30 years time has not changed, is extremely classic. It is also one of the cornerstones of deep learning. Still the same, the following basic reading notes (sentence translation + their own understanding), the contents of the book to comb over, and why the purpose, note down after you can use.

5.2 Network Training

We can think of NN as a general nonlinear function and transform input vector x into output vector y, which can be analogous to the polynomial curve fitting problem in the first chapter. Given the input collection, the target collection, sum-of-squares error function is defined as:

The main point of this section is to indicate that the error function can also be deduced from the maximum likelihood estimation. See (5.12-14). This part is simple and time-perfect.

(case 1) The upper y can be an identity, i.e.

(case 2) Of course it can also be a two classification problem logistic regression model (can refer to the contents of the 4th Chapter Logistic regression), dealing with a single 2 classification problem.

The conditional probability for a category of a sample is a Bernoulli distribution Bernoulli distribution:

The error function defined on the dataset is cross-entropy:

It has been proved that the objective function using Cross-entropy as a classification problem can be stronger than the minimum mean variance generalization ability, and the training is faster.

(case 3) If the classification we are going to do is a K-independent dichotomy, then the above conditional distribution is modified to:

Error Funciton:

Here is the argument sharing, the first layer of neural network parameters are actually contributed by all the neurons in the output layer, such contributions can reduce a certain amount of computation and improve the generalization ability.

(case 4) when we consider a classification problem that is not independent, but 1-of-k, which means that each result is mutually exclusive, we need to use the SOFTMAX classification:

The error function definition on the dataset:

Wherein, Softmax's excitation function is defined as

The above paragraph illustrates a translational invariance of Softmax, but disappears under the regularization framework.

To summarize:

Let's talk about the optimization method:

5.2.4 Gradient descent optimization

The gradient descent (GD) formula is this:

This is also called batch model, where the gradient is defined on the entire data set, that is, every step of the iteration requires the entire data set. Each step in the parameter optimization process is moving in the direction of the error function, which is called the gradient descent algorithm, or the steepest gradient drops. But this method is easier to find local optima, such as the following illustration, from Leftnoteasy

Initially we were in a random position and wanted to find the bottom of the lowest target, but in fact we didn't know if we found the global Optima. There are faster ways to optimize the batch model, such as conjugate gradients and Quasi-Newton methods. If you want to get a good enough minimum value, you need to perform several rounds of GD, each time choosing a different immediate initial point, and verify the validity of the result in the validation set.

There is also a on-line version of the gradient descent (or sequential gradient descent or stochastic gradient descent), which is proven to be very effective when training a neural network. The error function defined on the dataset is the sum of the error function of each individual sample:

So, the update formula for on-line GD is:

Each time a sample is updated, the method is to take one sample at a time, or to have the random sequential put back. There is also an intermediate pattern between ONLINEGD and GD, based on data from a batch. The benefits of ONLINEGD are: low computational capacity and the ease of escaping from some local optima.

5.3 Error BackPropagation reverse conduction

In this section, we discuss a fast method for calculating the forward network error function e (w) gradient-known as the error backpropagation algorithm, or simply backprop.

It is worth mentioning that BackPropagation also has similar names elsewhere, such as multilayer perceptron (MLP), which is often called the BackPropagation network. BackPropagation in it means to train the MLP by means of a gradient descent. In fact, most algorithms (training) involve an iterative process to minimize the objective function, in which there are basically two stages: one is to calculate the derivative of the error function for the parameter, and BP provides a fast and efficient method for calculating the derivative of all the parameters. The second is to update the original parameters by finding the derivative, the most common method is the gradient descent method. These two phases are independent of each other, which means that the idea of the BP algorithm is not only used for networks such as MLP, nor is the error function,bp, which can be used only for mean square error, to be used in many other algorithms.

5.3.1 Evaluation of Error-function derivatives

Next, let's deduce the BP algorithm, provided that the arbitrary, non-linear excitation function in a forward network of any topological structure, as well as the support of a series of error functions (which is generally very common). The derivation process is illustrated by a neural network with a hidden layer and the error function of the mean square error.

The common error function is defined on a I.I.D (independently distributed) dataset, as in the following form:

Below we will consider a gradient for one of the error function. This result can be used directly for sequence optimization (sequential optimization) or to add up the results for batch optimization. (Note: This so-called sequence optimization is now widely known as a random gradient drop.) )

First, let's consider the simplest linear output function:

Y_k is the K output of the sample x (assuming that the output layer has more than one node) and is a linear combination of all the dimensions of x. More generally, we define the error function on any one sample x_n:

Where the above error function gradient for the parameter is:

This result can be seen as a "local calculation"-this product is part of the error connected to the output of the weight, and the other part is the variable is connected to the input side of the weight. The above form also appears in logistic regression (section 4.3.2), which is similar in Softmax, as seen in the more general multilayer forward network.

in a forward network of a general structure , each neuron (not the input layer) calculates the weighted sum of its inputs:

Where Zi is the excitation value output of the previous neuron (which is called the node or nodes, etc.), and the input value is input to the node J, which is the weight of the connection. At the beginning of today's PRML-5.1 festival, we introduced that by introducing an additional input node and the fixed excitation value is +1, we can combine the bias item in the above summation. Therefore, we treat bias as the same as the other weights to be evaluated. Then we get the form of the excitation function of node J:

The above two formulas illustrate the process by which a neuron obtains the input value and then obtains the excitation value output. For any one of the samples in the training set, we have repeatedly calculated the excitation values of all the hidden neurons and the output neurons through the above two formulas. This process is called forward propagation, as if a forward flow is passing through the network.

The derivative of the error function pair is deduced below, and the values of each of the following nodes depend on the specific sample n, but for the sake of clarity, we omit the n tag. Weights can only affect the neuron J by the value of the input network, so it is deduced through the chain rule:

Remember

This expression is important and can generally be referred to as error (error or residuals); from (5.48) can be obtained:

So you can get:

As with the linear model mentioned earlier, the above derivative is also obtained by the error at the output of the connection and the result of the input input value (Z=1 is the bias term). Therefore, the key is to calculate the value of hidden neurons and output neurons in the network.

for the output layer , any one neuron K can be obtained:

Note: This deduction is directly from (5.46), in the book is derived from the linear output, that is, Y_k=a_k. If it is not a linear output, but an F (a_k), then multiply by the derivative of an F (a_k).

for hidden layers , we use the chain rule:

K represents the next layer of neurons in all J neurons. This formula means that the effect of j neurons on the target error function is achieved only by all. by (5.51) (5.48-49), you can get

Called the inverse conduction law. Here, you can probably see why it is called "reverse conduction", can be further understood from Figure 5.7: Error propagation is from the output layer by layer. The residuals are computed through the output layer and the last layer of parameters is calculated and then propagated back.

Finally, we summarize the process of BP algorithm: (here lazy to borrow the summary of the book directly:))

The idea is clear. If the traditional gradient descent is solved, the derivative of all the samples must be summed up and then used in the traditional gradient descent formula.

5.3.2 A Simple Example

Here is a slightly more specific example to illustrate: a two-layer neural Network (5.1), the output layer is a linear output, and the use of sum-of-squares error, the excitation function using the hyperbolic tangent function Tanh,

and derivative:

Defines the error function above a sample n:


Where YK is the output value, TK is the target value; The forward propagation process can be described as:

Then the residuals of each output neuron and the residual of the hidden layer neurons are calculated:

Finally, the derivative of the first layer parameter and the second layer parameter are obtained, respectively, for the gradient descent calculation.

5.3.3 Efficiency of BackPropagation

One of the main problems of neural network computing is the large amount of computation, in the previous model, if there is a W number of connections (neuron synapses), then the complexity of a forward propagation is O (w), and generally W is far greater than the number of neuron nodes.

As you can see in 5.48, each parameter needs to have one multiplication and one addition at a time.

Another way to find the inverse is to use a numerical method,

The numerical calculation accuracy is a problem, we can change very small, until close to the limit of precision. Using the symmetrical central differences can greatly cover the above accuracy problems:

But the calculation is almost twice times the amount of (5.68). In fact, the calculation of numerical methods can not take advantage of the previous useful information, each derivative needs to be calculated independently, the calculation can not be simplified.

But the interesting thing is that the numerical derivative is useful in another place--gradient check! We can use the results of the central differences and the derivative of the BP algorithm to compare, in order to determine whether the BP algorithm execution is correct.

Starting today to learn the pattern recognition and machine learning (PRML), chapter 5.2-5.3,neural Networks Neural network training (BP algorithm)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.