Reading notes: Neuralnetworkanddeeplearning CHAPTER5

Source: Internet
Author: User

(This article is based on the fifth chapter of neuralnetworksanddeeplearning this book, why is deep neural networks hard to train?

In the previous notes, we have learned the neural network's core BP algorithm, as well as some improved schemes (such as the introduction of cross-entropy function) to improve the speed of network training. But if you think carefully you will find that the network in the previous example is "shallow", up to one or two hidden layers, and as soon as the network layer increases, there are many problems that will inevitably be exposed. Today, we come to know one of the most egg-hurting problems: deep neural networks are very difficult to train.

The deeper the network, the better the effect.

Research in recent years has shown that the deeper the layer of the network, the more powerful the model can express. In image recognition, the first layer of the network learns how to recognize edges, and the second layer learns how to recognize more complex shapes, such as triangles, on the basis of the first layer, and then the third layer continues to learn to recognize more complex shapes on the basis of the second layer. In the end, the network learns to recognize more advanced semantic information. That's why deep learning can make a breakthrough in recent years, because deep neural networks are so expressive.

However, in the course of training the deep neural network, people also encountered a serious problem: when the network at the back layer of rapid training, the front layer of the network is "Frozen", parameters are no longer updated; sometimes things are just the opposite, the front layer of the network training quickly, the back layer of the network convergence.

Through this section of learning, we will understand the underlying reasons behind all this.

Gradient of Vanishing

Following the previous MNIST example, let's start with a few sets of experiments to see what gradient disappears.

In these sets of experiments, our network structure is as follows:

= network2.Network([7843010= network2.Network([784303010= network2.Network([78430303010= network2.Network([7843030303010])

The only difference between these networks is that each network has a more hidden layer of 30 neurons than the previous one. In the experiment, the other parameters, including the training data, are exactly the same. On the MNIST data set, the accuracy rate of the four experiments is as follows: 96.48%,96.90%,96.57%,96.53%.

As you can see, the second network training result is better than the first one, but when the hidden layer continues to increase, the effect drops. This surprises us, not that the deeper the network layer, the better the effect? Moreover, even if the middle network has not learned anything, it will not be the negative effect of it.

To further understand the reasons behind it, we intend to follow the network parameters to confirm whether the network is really trained.

For simplicity's sake, let's analyze ([784, 30, 30, 10]) the gradient of two hidden layers in the second network. Demonstrates the gradient value of each neuron in the two layers at the beginning of the training, and for convenience, only the first six neurons are extracted:

The histogram of the neurons on the graph represents the gradient value \ (\partial C/\partial b\), in BP's four formula, we know:\ (\frac{\partial c}{\partial B_j^{l}}=\delta_ J^l \tag{bp3}\) \ (\frac{\partial c}{\partial w_{jk}^{l}}=a_{k}^{l-1}\delta_{j}^{l} \tag{bp4}\)

Therefore, the histogram is expressed in addition to the deviation bias gradient, but also how much can reflect the weight of weights gradient.

Because the initialization of weights is random, so the gradient of each neuron is different, but it is obvious that the gradient of the 2nd hidden layer is larger than the 1th hidden layer, and the larger the gradient, the faster the learning speed.

To explore whether this is accidental (perhaps the neurons in the two layers will be different), we decided to use a global gradient vector \ (\delta\) to compare the overall gradient of the two hidden layer parameters. We define \ (\delta_j^l=\partial C/\partial b_j^l\), so you can think of \ (\delta\) as a vector of all the neurons in the gradient. We use the vector length \ (| | \delta^i| | \) to represent the learning speed of each I hidden layer.

When there are only two hidden layers (that is),\ (| | \delta^1| | =0.07\),\ (| | \delta^2| | =0.31\), this further verifies that the second hidden layer has a higher learning rate than the first hidden layer.

What if there are three hidden layers? The result is:\ (| | \delta^1| | =0.012\),\ (| | \delta^2| | =0.060\),\ (| | \delta^3| | =0.283\). Similarly, the hidden layers behind the learning speed are higher than the previous ones.

One might say that the above gradients are calculated at some point in the beginning of training, and will these gradients be further improved in the course of network training? To answer this question, we calculate the gradient after more rounds of study and draw the following graph:

Obviously, no matter how many hidden layers, the back layer of the learning speed is 5 to 10 times times higher than the previous layer, so that the first hidden layer of learning speed even only the last layer of 1%, when the back of the parameters are training, the parameters of the front layer is basically stagnant. This phenomenon is called the gradient disappears . The gradient disappears does not mean that the network has converged, because in the experiment, we deliberately at the beginning of the training to calculate the gradient, for a random initialization of a parameter network, it is almost impossible to let the network converge at the beginning, so we think the gradient disappears is not caused by network convergence.

In addition, as the study progresses, we will also find that sometimes the gradient of the front layer does not disappear, but it becomes very large, almost hundreds of times behind the layer, resulting in a NaN, almost "exploded". In this case, we call it a gradient explosion .

Whether the gradient disappears or explodes, we don't want to see it. Below we need to further study the cause of this phenomenon and find ways to solve it.

Reasons for the disappearance of gradients

In this section, let's explore: Why does the network gradient disappear? Or why the gradients of deep neural networks are so unstable.

For simplicity's sake, let's analyze a network with only one neuron:

\ (b\) and \ (w\) represent parameters,\ (c\) is a cost function, the activation function takes sigmoid, each layer of network output is \ (a_j=\sigma (Z_j) \),\ (z_j=w_ja_{j-1}+b_j\).

Below, we ask out \ (\partial c/\partial b_1\)to see what causes this value to be small.

According to the BP formula can be introduced:

The formula looks slightly more complicated than it is, so let's take a look at how it came about. Since the network is very simple (there is only one strand), we are prepared to introduce it from another, more visual angle (BP is also fully capable of introducing the equation).

Suppose there is an increment \ (\delta b_1\) that appears because \ (A_1=\sigma (z_1) =\sigma (w_1a_0+b_1) \), you can eject:

\ (\delta a_1 \approx \frac{\partial \sigma ((w_1\ a_0+b_1)}{\partial b_1} \delta b_1=\sigma ' (z_1) \Delta b_1\) (Note \ (\delta a_1\) is not a derivative, but an increment caused by \ (\delta b_1\) , so it is the slope multiplied by \ (\delta b_1\)).

Then further,\ (\delta a_1\) will also cause \ (z_2\) changes, according to \ (z_2=w_2 a_1+b_2\) can be drawn:

\ (\delta z_2 \approx \frac{\partial z_2}{\partial a_1}\delta a_1=w_2 \delta a_1\).

The previous \ (\delta a_1\) formula can be obtained by substituting:

\ (\delta z_2 \approx \sigma ' (z_1) w_2\delta b_1\).

As you can see, the formula is very similar to the one we started with. After that, we can get the increment of \ (c\) according to the continuous calculation of the gourd drawing scoop:

\ (\delta C \approx \sigma ' (z_1) w_2 \sigma ' (z_2) \ldots \sigma ' (z_4) \frac{\partial c}{\partial a_4} \Delta b_1 \tag{120}\)

by dividing \ (\delta b_1\) , you can get the first equation:

\ (\frac{\partial c}{\partial b_1} = \sigma ' (z_1) w_2 \sigma ' (z_2) \ldots\sigma ' (z_4) \frac{\partial C}{\partial a_4}.\ tag{121}\)

Why does the gradient go away

With the above formula as a cushion, have you guessed the reason why the gradient disappears. Yes, just like \ (0.9^n \approx 0\) .

First, let's review the image of the \ (\sigma ' () \) function:

The maximum value of this function is only 1/4. Add our parameter \ (w\) is based on the mean value of 0, the standard deviation is 1 of the Gaussian distribution initialized, that is \ (|w_j|<1\) , so \ (|w_j \sigma ' (z_j) <1/4|\). When these items are multiplied, the final result will be smaller. Pay more attention to the following picture, due to different hidden layers of the derivative multiplicative number of different, so the corresponding gradient also has a high and low points.

Although the above deduction is not very formal, it is sufficient to clarify the root cause of the problem.

The problem of gradient explosion is not discussed here, the principle and gradient disappear, when each item's value is greater than 1 o'clock, the tired multiply will become very big.

Remember at the end of the previous study note, I asked a question: Although the cross-entropy function solves the problem of the decline of the network learning speed, it is only for the last layer, and for the hidden layer in front, the learning speed may still decrease. The author has previously avoided this problem because the number of network layers previously targeted was small, and this article has shown that the source of the problem has been identified and analyzed.

Gradients in complex networks are equally unstable

In the example above we just use a simple example to explain why, in a more complex network, we can still use similar methods to explain the instability of gradients.

For example, for the following complex network:

We can use the BP formula to launch:
\[\begin{eqnarray} \delta^l = \sigma ' (z^l) (w^{l+1}) ^t \sigma ' (z^{l+1}) (w^{l+2}) ^t \ldots \sigma ' (z^l) \nabla_a C\tag{ 124}\end{eqnarray}\]
In this,\ (\sigma ' (z^l) \) is the diagonal matrix, and the elements on the diagonal of the matrix are composed of the values of \ (\sigma ' (z) \) . \ (\nabla_a c\) is a vector obtained by \ (c\) for the output layer.

This formula, though many, is still the same in form, and the cumulative effect of the last matrix multiplication will still cause the gradient to disappear or explode.

Other barriers to deep learning

Although we are only referring to the problem of gradient instability in this chapter, in fact, there are many studies that show that deep learning also has many other obstacles.

For example: the choice of activation function will affect the learning of the network (see Thesis: Understanding the difficulty of training deep feedforward neural networks).

Another example: initialization of parameters can also affect the training of the network (see Thesis: On the importance of initialization and momentum in deep learning).

It can be seen that the training obstacle of deep neural network is still a complex problem, which needs further research. In the next chapter, we will continue to learn some deep learning methods that, to some extent, can overcome these learning disabilities in deep neural networks.

Reference
    • Why is deep neural networks hard to train?

Reading notes: Neuralnetworkanddeeplearning CHAPTER5

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.