Neural network Those Things (ii)

Source: Internet
Author: User

In the previous article, we saw how neural networks use gradient descent algorithms to learn their weights and biases. However, we still have some explanations: we did not discuss how to calculate the gradient of the loss function. This article will explain the well-known BP algorithm, which is a fast algorithm for calculating gradients.

The inverse propagation algorithm (backpropagation ALGORITHM,BP) was presented at 1970s, but its importance was not accepted until a famous 1986-year paper was published. The paper was written by David Rumelhart,geoffrey Hinton, and Ronald Williams. This paper describes several neural networks in which the inverse propagation algorithm works faster than earlier learning algorithms, which allows the use of neural networks to solve previously unresolved problems. Today, the inverse propagation algorithm is the most core algorithm in neural networks.

The core of the BP algorithm is the expression of the loss function C in the network, respectively, about the partial derivative of any weighted w (or bias B). This expression tells us how quickly the loss function changes when we change the weights and biases. The BP algorithm is not only a fast learning algorithm, it actually gives more details about how weights and biases change the behavior of the network.

Warm-up: A method of fast computing neural network output based on matrix

Before we discuss the reverse propagation, let's warm up and look at a fast matrix-based algorithm for computing neural network output. We have actually briefly seen this algorithm at the end of the last one. In particular, this is a natural way to familiarize yourself with the symbols used in reverse propagation.

We will use the weights that represent the connection between the K-neurons of the L-1 layer and the J-neurons of the first layer of the second level. For example, the weight value of the connection between the 2nd neuron in the 3rd layer and the 4th neuron of the 2nd layer:

So, for biasing and activating, we use a similar symbol. Obviously, we use the first J neurons that represent the first layer of the L. Also, we use to represent the activation of the first J neurons of the first layer. Just like this:

We can find that the activation of the J neurons in layer L is related to the activation of the L-1 layer, as in the following equation:

Here and all neurons representing the L-1 layer K. To write the expression in a matrix form, we define a weight matrix for each layer L. The elements of the weight matrix are simply the weights connected to the neurons of the first layer, i.e., the elements of line J and column J are. Similarly, for each layer L, we define a biased vector. The element of the biased vector is just a value, an element of each neuron of the first layer. Finally, we define an activation vector whose composition is activation. What we're going to do next is to quantify a function, for example. To put it simply, if we have a function, then the vectorization form of F has the result:

So we can write the (23) formula:

This writing is not only a lot simpler in the notation, but also helps you understand the equation in general: The activation of the first layer is equal to the activation of the previous level (L-1) and the sum of the weights connected to the L-layer neurons plus the offset of the layer, then the sigmoid function is applied. Most crucially, many linear algebra libraries are computationally fast based on matrices.

When using the equation (25) to calculate, we calculate an intermediate amount along this way. This volume indicates that it is useful enough to be renamed: We put the weight input (weighted input) called to the level L neurons. The equation (25) is sometimes written in the form of a weight input (weighted input), as in. It is then important to note that there is such a composition, that is, just the first L layer of neuron j to the activation function of the weight input.

Two assumptions about the loss function we need

The goal of the BP algorithm is to calculate the derivative of the loss function C about the weighted W or bias B in any network. Here is an example of a two-time loss function. That

Here n is the total number of training samples, the sum of all the individual training samples x, Y=y (x) is the corresponding desired output, l represents the number of layers, and is the activation vector of the network entered as X.

The first hypothesis is that the loss function can be written as a loss function for a single training sample. We always follow this assumption.

The reason we need this hypothesis is because the reverse propagation actually lets us do is to calculate the partial derivative of the individual training sample and. And then the average of all the training samples to calculate and. In fact, we can assume that the training sample x has been fixed, then drop the x subscript and write the loss.

The 2nd hypothesis is that a loss can be written as a function of the output of a neural network:

For example, the two-time loss function satisfies this requirement because the two losses of a single training sample x may be written as:

Hadamard product,

The inverse propagation algorithm is based on ordinary linear algebra operations-like vector addition, multiplied by a matrix by a vector. But one of the operations used is relatively small. Suppose that s and T are vectors of the same number as two dimensions. Then we use to represent two-element-by-phase multiplication. Therefore, the composition is. For example:

This by-phase multiplication is sometimes called the Hadamard product. A good matrix library usually provides a fast implementation of the Hadamard product, which is handy for implementing reverse propagation.

Four equations behind the reverse propagation

Reverse propagation is about understanding how weights and biases change when a network changes the loss function. Ultimately, this means calculating the partial derivative and. But in order to calculate those, we first introduce an intermediate variable, which we call the error of the first J neuron of the first L layer. BP will give a process to calculate the error, then correlate and associate with.

To understand how error is defined, imagine that there is an elf in our neural network:

The elf sits on the first J neuron of layer L. When the neuron input comes in, the genie disrupts the operation of the neuron. It adds a small change to the weight input of the neuron, so the output is no longer, but. This change propagates to the back layer of the network, resulting in a loss function that changes a volume.

Now, the genie is a good genie and is helping you to improve the loss function, which is that they are trying to find one that makes the loss function smaller. Suppose there is a large value (either positive or negative, or absolute). The genie can then reduce the loss function by selecting the symbol with the opposite. Conversely, if it is close to 0, the genie cannot improve the loss function by disturbing the weight input. So far, the Elves have been able to tell us that neurons are pretty close to optimal. Therefore, there is a heuristic that is a measure of the error of neurons. So we define the error of the neuron J of the layer L as:

In our tradition, we used to represent the error vectors associated with the layer L. The reverse propagation will tell us how to calculate each layer and then associate these errors with the amount of real interest we have.

You may wonder why the genie is changing the weight input. Of course, the imaginary genie is changing the output activation and then using it to measure our error, which may be more natural. In fact, if you do this, it will be similar to the result of the discussion below. However, it will make the statement that BP becomes slightly more complex in algebra. Therefore, we will persist as a measure of our error.

Conquer plan:

BP is an equation based on four foundations. Together, these equations give us a way to calculate the error (which you can translate as residuals) and the gradient of the loss function.

An equation for the output layer residuals

: The composition is given by the following equation:

This is a very natural formula. The first item on the right, is just a measure of how fast the loss function changes as a function that is activated as the first J output. For example, if C does not depend on a particular output neuron J a lot, then it will be very small, which is exactly what we expected. The second item on the right is how fast the activation function changes on the point. We can see that all the items in (BP1) are easy to calculate. In particular, we calculate the behavior of the network at the same time, and the calculation is only a small cost. The real form will depend on the form of the loss function. However, the provided loss function is known and there is a little bit of trouble at the time of calculation. For example, if we use a two-time loss function, then, therefore, it is very easy to calculate.

An equation (BP1) is an expression of a component way. It's a perfect expression, but it's not the matrix-based BP expression we want. However, it is also simple to write it in the form of a matrix:

Here, it is defined as a vector consisting of partial derivatives. You can think of it as an expression of C about the output activation of a rate of change. So, the final matrix-based form is:

An equation for residuals represented by residuals in the next layer

: in particular

Suppose we know that the residuals are on the first floor. When we apply the transpose weights matrix, we can visually think of this as a residual error passing through the network, giving us a way to measure the residual of the output of the L-layer. When using the Hadamard product, it moves the residuals backward through the activation function of the L-layer, giving the residual of the weight input to the L-layer.

Combining BP2 and BP1, we can calculate the residuals of any layer in the network. We first calculate by BP1, then apply the equation BP2 to calculate, then use the equation BP2 calculation again, and so on, and so on, in such a way reverse through the network.

The loss function in the network is an equation about the rate of change of any bias:

In particular:

That is, the residuals are actually equal to the rate of change. This information is useful because BP1 and BP2 have already told us how to calculate. We can rewrite BP3:

Loss function an equation about the rate of change of any weight:

In particular:

This tells us how to use the amount and to calculate the partial derivative, the first two quantities we already know how to calculate. Can be overridden for the following form:

One is the activation of a neuron with a weight of w input, which is the output residuals from the neuron of the weight w. Can be simplified to the following graphic:

A good result of the equation (32) is that when the activation is very small, the gradient items will also tend to be very small. In this case, we will say that weights are slow to learn, which means that they change a lot in the gradient descent process. In other words, one result of BP4 is to learn slowly from the weight of the low-active neuron output.

Other information can also be seen from the BP1-BP4. First look at the output layer. Consider the items in (BP1). Recalling the diagram of the sigmoid function in the previous chapter, when approaching 0 or 1, the function becomes very flat. When this happens, we can get. So if the output neuron is low active () or high active (), then the weights in the last layer will be slow to learn. In this case, it is usually said that the output neurons are saturated, and the weights have stopped learning (or learning slowly). Similar conclusions are established for biasing in the output neurons.

We can also get some similar information for the non-output layer. In particular, note the items in the BP2. This means that if a neuron approaches saturation, it may become smaller. In turn, this also means that any weights entered into a saturated neuron will also be slow to learn.

As can be summed up, we already know that if the input neuron is low-active or the output neuron is saturated, that is either high-active or low-active, then a weight will be learned slowly. There is not much to be surprised by the information just observed. However, they still help to improve what happens when a neural network learns about our model of thinking.

Proof of the four basic equations

We will prove four basic equations (BP1)-(BP4). All four equations are the result of the chain law of multivariate calculus. First look at the (BP1) equation, which gives the output residuals. In order to prove this equation, recall definition:

Using chain rules, we can override the partial derivative of the output activation to represent

The above partial derivative:

and represents the neuron k in all output layers. Of course, the k neuron only depends on the output activation of the input weights of the J neurons when k=j. As a result, the item disappeared. So the preceding equation can be simplified to:

Because the 2nd item on the right can be written, the equation becomes:

This is the form of BP1. The next BP2 and BP3 's proofs leave the reader to think for themselves.

Inverse propagation algorithm

The BP algorithm provides us with a method to calculate the loss function gradient. Written in the form of an algorithm:

    1. Enter x: Set the appropriate activation for the input layer.
    2. Forward: For each l=2,3,..., l calculation as well.
    3. Output residuals: Calculates the vector.
    4. Reverse propagation residuals:,..., 2 calculation for each l=l-1,l-2.
    5. Output: The gradient of the loss function is the and.

We calculate the residual vector backwards from the last layer, which may seem odd that we are traversing the network backwards. However, if you think of the proof of reverse propagation, the reverse movement is caused by the fact that the loss is a function of the output of the network. In order to understand how the loss varies with previous weights and biases, we need to apply the chain rule repeatedly, and the layers are calculated backwards to get the available expression information. The inverse propagation algorithm calculates the gradient of a single training sample, that is. In practice, the inverse propagation is usually combined with a learning algorithm such as a random gradient, in which we calculate the gradient of many training samples in the random gradient algorithm. In particular, given a mini-batch with an M training sample, the following algorithm applies a gradient descent learning algorithm based on Mini-batch:

    1. Enter a collection of training samples
    2. For each training sample x: Set the appropriate input activation, and then perform the following steps:
      1. Forward: For each l=2,3,..., l calculation as well.
      2. Output residuals: Calculates the vector.
      3. Reverse propagation residuals: 2 calculation for each l=l-1,l-2,...,
    3. Gradient descent: For each l=l,l-1,..., 2 updates the weights according to rules, which are more closely biased according to the rules.

Of course, in practice, in order to achieve a random gradient drop, you also need an outer loop to produce the mini-batches of the training sample, as well as a multiple epochs of the outer loop stepping through the training.

Reverse Propagation Code implementation:

For an analysis of the code, see my next article, "Neural Network Code analysis."

Reprint http://www.gumpcs.com/index.php/archives/962

Neural network Those Things (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.