Reading notes: Neuralnetworksanddeeplearning Chapter3 (1)

Source: Internet
Author: User

(This article is based on Neuralnetworksanddeeplearning the book's third chapter improving the neural networks learn of reading notes, according to personal tastes have been cut)

In the previous chapter, we took a glimpse of the most important algorithms in neural networks: The Forward Propagation algorithm (BP). It makes the training of neural networks possible and is the foundation of other advanced algorithms. Today, we will continue to learn other methods that make the training results of the network better.

These methods include:

    • Better cost function: cross-entropy (cross-entropy) function
    • Four regularization methods:L1,L2,dropout , and artificial augmentation of datasets
    • A better method of initializing weights
    • A series of heuristic strategies for selecting hyper-parameters
    • Some other tips
Cross-entropy function (cross-entropy)

In real life, we all have this experience: when we encounter mistakes, we often learn the most things, and if we are vague about our mistakes, progress will slow down.

Similarly, we want neural networks to learn faster from mistakes. What is the actual situation? Look at a simple example.

This example contains only one neuron and only one input. We'll train this neuron so that: when input is 1 o'clock, the output is 0. We initialize the weights and deviations to 0.6 and 0.9 respectively. When input is 1 o'clock, the network output is 0.82 (\ (\frac{1}{1+e^{-1.5}} \approx 0.82\)). We use the square difference function to train the network and set the learning rate to 0.15.

This network has actually degenerated into a linear regression model. Here's an animation to demonstrate the training process for the network:

From this we can see that the neuron learns the parameters quickly, and the final output is 0.09 (already close to 0). Now, we initialize the parameters and deviations to 2.0, the initial output of the network is 0.98 (far from the result we want), and the learning rate is still 0.15. See how this network will learn:

Although the learning rate was the same as the last time, the network began to learn slowly, at the beginning of the 150 studies, the parameters and deviations almost unchanged, then the learning speed suddenly increased, the neuron output quickly dropped to nearly 0.0. This is very different, because when the output of neurons is seriously wrong, the speed of learning is not very fast.

Below we need to understand the root cause of the problem. When the neurons are trained, the learning speed is influenced by the learning rate, the partial derivative (\partial c/\partial w\) and \ (\partial c/\partial b\) . Therefore, the learning speed is very slow, that is, the value of the partial derivative is too small. Under
\[c=\frac{(y-a) ^2}{2} \tag{54}\]
(where,\ (A=\sigma (z) \),\ (z=wx+b\)), we can find (in the following two equations, the values of x and Y have been replaced with 1 and 0):
\[\frac{\partial c}{\partial W} = (a-y) \sigma ' (z) x=a\sigma ' (z) \tag{55} \]

\[\frac{\partial c}{\partial B} = (a-y) \sigma ' (z) =a\sigma ' (z) \tag{56}\]

To understand these two equations in depth, we need to review the contents of the sigmoid function, such as:

From the function image we can find that when the function value is close to 1 or 0 o'clock, the function derivative tends to 0, resulting in a value of (55) and (56) Two formulas tending to 0. This is why neurons begin to learn at a slower rate, and the middle section learns faster.

Introduction of cross-entropy cost function

To solve the problem of declining learning speed, we need to make a fuss from two partial derivatives. Either change the cost function or replace the \ (\sigma\) function. Here, we use the first approach to replace the cost function with the cross-entropy function (cross-entropy).

First, an example is used to introduce the cross-entropy function.

Suppose we have the following neurons:

The cross-entropy function is defined as (this assumes that Y is a probability value, between 0~1, so that it can be matched with a):
\[c=-\frac{1}{n}\sum_x{[y \ln A + (1-y) \ln (1-a)]} \tag{57}\]
Of course, it is not intuitive to see that this function can solve the problem of learning rate decline, and even do not see this can be a cost function.

Let's first explain why this function can be used as a cost function. First, this function is non-negative, that is \ (c>0\)(note \ (a\) the value between 0~1). Second, when the actual output of the neuron is close to what we want, the value of the cross-entropy function tends to be nearly 0. Therefore, the cross entropy satisfies the basic condition of the cost function.

In addition, the cross-entropy solves the problem of declining learning rate. We will \ (A=\sigma (z) \) into (57) and use the chain rule to get (here \ (w_j\) The should refer specifically to the last layer of parameters, i.e. \ (w_j^l\) ):
\[\begin{ Eqnarray} \frac{\partial c}{\partial W_j} & = &-\frac{1}{n} \sum_x \left (\frac{y}{\sigma (z)}-\frac{(1-y)}{1-\s Igma (z)} \right) \frac{\partial \sigma}{\partial w_j} \tag{58}\ & = &-\frac{1}{n} \sum_x \left (\frac{y}{\sigma (z )}-\frac{(1-y)}{1-\sigma (z)} \right) \sigma ' (z) x_j.\tag{59}\end{eqnarray}\]
Simplify the formula and \ (\sigma (z) =\frac{1}{1+e^{-z}}\) after generation:
\[\frac{ \partial c}{\partial w_j}=\frac{1}{n}\sum_x {x_j (\sigma (z)-y)} \tag{61}\]
This expression is exactly what we want! It shows that the learning rate is controlled by the \ (\sigma (z)-y\) , which means that the higher the error, the faster the learning rate. It also avoids the problem of \ (\sigma ' () \) resulting in decreased learning rates.

Similarly, we can calculate:
\[\frac{\partial c}{\partial b}=\frac{1}{n}\sum_x{(\sigma (z)-y)} \tag{62}\]
Now, we'll apply the crossover entropy to the previous example to see what changes in the training of neurons.

The first is the example of the initial values of weights and deviations of 0.6 and 0.9:

You can see that the training speed of the network is near perfect.

Then there is an example of the initial value of weights and deviations of 2.0:

This time, as we expected, neurons learn very quickly.

In these two experiments, the rate of study used was 0.005. In fact, for different cost functions, the learning rate should be adjusted accordingly.

The above discussion of the cross-entropy function is only for one neuron, it is very easy to extend it to the network structure of multilayer neurons. Assuming \ (y=y_1, y_2, \dots\) is the desired network output, and \ (a_1^l, A_2^l, \dots\) is the actual output of the network, then the Cross-entropy function can be defined as:
\[c=-\frac{1}{n}\sum_x \sum_y {[Y_j \ln a_j^l + (1-y_j) \ln (1-a_j^l)]} \tag{63}\]
Well, introduced so much, then when we use the square difference function, when to use cross-entropy? The author gives the opinion that cross-entropy is almost always a better choice, and the reason is similar to the one mentioned above, the square difference function is easy to start with a slow training rate, and the cross-entropy does not have this problem. Of course, this problem arises from the premise that the sigmoid function is used in the square difference function.

What exactly is cross-entropy and how does it come about?

In this section, we want to know how the first crab-eating man thought of the cross-entropy function.

Suppose we find that the root cause of the decline in learning rate is the \ (\sigma ' (z) \) function, how do we solve this problem? Of course, there are a lot of ways, here we consider the idea: whether we can find a new cost function, will \ (\sigma ' (z) \) this item eliminated? If we want the final derivative to meet the following form:
\[\frac{\partial c}{\partial W_j}=x_j (a-y) \tag{71}\]

\[\frac{\partial c}{\partial b}= (a-y) \tag{72}\]

These two partial derivatives can make the neural network more accurate, the faster the training speed.

Recalling BP's four formulas, you can get:
\[\frac{\partial c}{\partial b}=\frac{\partial c}{\partial a}\sigma ' (z) \tag{73}\]
The \ (\sigma () \) function here is sigmoid, so \ (\sigma ' (z) =\sigma (z) (1-\sigma (z)) =a (1-a) \), the descendant of this type (73), gets:
\[\frac{\partial c}{\partial b}=\frac{\partial c}{\partial a}a (1-a) \]
Compared with our final goal (72), we need to meet:
\[\frac{\partial c}{\partial a}=\frac{a-y}{1 (1-a)} \tag{75}\]
When the points are made to (75), they are:
\[c=-\frac{1}{n}\sum_x{[y\ln A + (1-y) \ln (1-a)]}+constant \tag{77}\]
So far, we have introduced the form of the cross-entropy function.

Of course, the real source of cross-entropy is information theory, and more specific introductions are beyond the scope of this tutorial, so it's no longer in depth.

Softmax

In the previous section, we focused on how cross-entropy solves the problem of training speed decline, which is from the perspective of cost function. In fact, we have another way, that is, replace the \ (\sigma () \) function. Here is a brief introduction to a new \ (\sigma () \) : Softmax.

The function of Softmax is similar to that of sigmoid, except that the former function form is:
\[a_j^l=\frac{e^{z_j^l}}{\sum_k{e^{z_k^l}}} \tag{78}\]
?? The denominator is the sum of all the output neurons. This means that after the Softmax function, the output of all neurons shows the pattern of probability distributions.

As the output of one of the neurons increases, the output value of the other neurons becomes smaller, and the sum of the smaller is equal to the value increased by the former. Vice versa. This is because the sum of the output values of all neurons is always 1.

In addition, the output of the Softmax is always positive.

Softmax solve the problem of learning rate decline

This time, we define a Log-likelihood cost function, through which to understand how Softmax alleviates learning slowdown problems.

The function of the Log-likelihood is:
\[c \equiv-\ln a_y^l \tag{80}\]
First explain \ (a_y^l\), for example, in the MNIST data set, we want to determine which of the 10 categories of a picture, then the output should be a 10-dimensional vector \ (a^l\), and the real result is the number \ (y \), such as 7. Then,\ (a_y^l\) indicates how high the probability value of the a_7^l\ item corresponds. If the probability value (near 1) is higher, the more correct the guess result, the smaller the value of \ (c\) , the greater the inverse.

With the cost function, we still find the partial derivative:
\[\frac{\partial c}{\partial B_j^l}=a_j^l-y_j \tag{81}\]

\[\frac{\partial c}{\partial W_{jk}^l}=a_k^{l-1} (A_j^l-y_j) \tag{82}\]

There is no such thing as a sigmoid derivative that makes the learning rate drop.

(Writing here, I suddenly have a doubt: whether it is the Softmax, but also the cross-entropy, we are only the last layer of the derivative and deviation of the biased guide, but the front layer of the partial derivative is not calculated, how can be sure that the front layer of the deflector will not encounter \ (\sigma ' () \) the problem that tends to be 0? To know, according to the BP algorithm formula, the error has such a transfer formula:\ (\delta^l\)= \ (((w^{l+1}) ^t \delta^{l+1}) \odot \sigma ' (z^l) \), note that Here will still appear \ (\sigma ' () \), and the weight of the front layer and the partial derivative of the deviation is based on this error calculation, so that the front layer of the study rate of decline problem is not solved it? This question is left for the time being, to see if the author has any answers. )

Having written so much, we have to ask a similar question: When to use sigmoid and cross-entropy, when to use Softmax and Log-likelihood. In fact, most of these two choices bring good results, and of course, if you want the output to present a probability distribution, Softmax will undoubtedly be better.

Reference
    • Improving the neural networks learn

Reading notes: Neuralnetworksanddeeplearning Chapter3 (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.