Reading notes: Neuralnetworksanddeeplearning Chapter3 (2)

Last Update:2018-01-07 Source: Internet

Author: User

Tags rounds

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(This article is based on Neuralnetworksanddeeplearning the book's third chapter improving the neural networks learn of reading notes, according to personal tastes have been cut)

In the previous chapter, we learned the cost function of improving network training: The cross-entropy function. Today, we introduce the problem of overfitting (overfitting) , which is easy to encounter in neural networks, and how to solve it: regularization (regularization).

Over fitting phenomenon

Before we know how to fit this problem, let's do an experiment first.

Suppose we use a network with 30 hidden layers and 23,860 parameters to predict the MNIST dataset. However, we only train with 1000 images in the data set. The training process is the same as in the past, the cost function uses the cross-entropy function, the learning rate \ (\eta = 0.5\), the batch size is 10, and training 400 rounds.

Is the change in cost during the training process:

As you can see, the cost is gradually getting smaller. But does this mean that the network is being trained better and better? Let's look at the exact rate of each round:

Before about 280 rounds of training, the accuracy of the network is actually slowly rising, but after that, we see that the accuracy rate is basically no big improvement, always maintained at 82.20 up and down. This is the opposite of the cost reduction. This seems to be training, in fact, the result is very poor, is to cross-fit (overfitting).

The reason for the overfitting is that the generalization ability of the network model is poor. In other words, the model fits the training data very well, but there is little fit for new data that has not been seen.

To learn more about the fitting phenomenon, let's look at other experiments.

Is the cost of the test data (previously on the training data) during the training process:

The cost was gradually improved in the first 15 rounds of training, but then it began to rise again. This is one of the signals that the network has had to fit.

Another signal to cross-fit, see:

This is the accuracy rate on the training set. As you can see, the accuracy of the network rises all the way up to 100%. One might wonder, is it not good to have a high accuracy rate? Indeed, high accuracy is what we need, but it must be the exact rate on the test set. And the high accuracy of the training set, the result is not necessarily a good thing. It could mean that the network "drilled the Niu Jiao Jian" on the training data. Instead of learning how to recognize handwritten numerals, it simply remembers what the training data looks like. In other words, it fits too much on the training data.

Overfitting is a common problem in modern neural networks because of the huge network parameters, and once the training samples are not rich enough, some parameters may not be trained. In order to effectively train the network, we need to learn techniques that can reduce overfitting.

Cross-validation set

In solving the problem of overfitting, we need to introduce another dataset-the cross-validation set (validation dataset).

Cross-validation sets can be considered a double-insurance measure. In solving the fitting, we will use a lot of skills, some of the techniques themselves with their own parameters (that is, we say the super-parameters (hyper parameter), if only in the test set test, the results may lead us to solve the fitting of the measure of the test set of " Suspect, or, in the test set, has been fitted. Therefore, using a new cross-validation set to evaluate the effect of the solution, and then experiment on the test set, can make the network model more generalization ability.

Three small ways to solve overfitting

It is called a small approach, which, although effective, has little or no practical significance.

Early stop

One obvious way to detect a fitting is to track the accuracy rate on the test set. When the accuracy rate no longer rises, stop training (early stop). Of course, strictly speaking, this is not a necessary and sufficient condition for overfitting, and the accuracy of the training set and the test set may cease to rise. But this strategy still helps mitigate overfitting problems.

In practice, however, we typically track the accuracy rate on the validation set, not the test set.

Increase training data

Is the change in the accuracy of the training set and the test set when training with all the training data.

As you can see, the accuracy of the network on the training set and the test set is only 2.53% (before 17.73%) compared to the previous 1000 training samples. In other words, after increasing the training data, the overfitting problem is relieved to a great extent. Therefore, the addition of training data is one way to solve the overfitting (and the simplest and most effective method, the so-called "algorithm is better than the data good"). However, adding data is not simply copying copies of the data, but rather enriching the type of data.

In the real world, the training data is difficult to obtain, so it is difficult to practice this method.

Reduce model parameters

Reducing the model parameters in nature and increasing the training data is the same, but for the neural network, the more parameters, the effect is generally better, so this method is not forced, we generally do not adopt.

Regularization of L2 regularization

Regularization is a common method for solving over-fitting. In this section, we'll cover the most common regularization techniques: L2 regularization (weight decay).

L2 regularization is the addition of a regularization item (regularization term)in the cost function. For example, the following is the regularization of the cross-entropy function:
\[c=-\frac{1}{n}\sum_{xj}{[y_j \ln a_j^l+ (1-y_j) \ln (1-a_j^l)]}+\frac{\lambda}{2n}\sum_w{w^2} \tag{85}\]
The so-called regularization, in fact, is the sum of squares of weights, the preceding \ (\lambda/2n\) is for all samples to take the mean, and \ (\lambda\) is we say the super-parameter . The value of \ (\lambda\) is then discussed. Note that there is no deviation in the regular term because the regularization effect on the deviation is not obvious, so the weights are generally only regularization.

L2 regularization can also be used in other cost functions, such as the square difference function:
\[c=\frac{1}{2n}\sum_x{| | t-a^l| | ^2}+\FRAC{\LAMBDA}{2N}\SUM_W{W^2} \tag{86}\]
We can write the L2 regularization formula:
\[\begin{eqnarray} C = C_0 + \frac{\lambda}{2n}\sum_w w^2,\tag{87}\end{eqnarray}\]
where\ (c_0\) is the original cost function.

Intuitively, the effect of regularization is to make the right value of learning as small as possible. It can be said that regularization is the minimization of the original cost function and the search for a small weight between the compromise. And the importance of the two is controlled by \ (\lambda\) . When \ (\lambda\) is large, the network reduces the weight as much as possible, and conversely, minimizes the original cost function.

Let's start with some experiments to see how this regularization works.

When you add a regularization item, the partial derivative of the gradient drop changes a little:
\[\begin{eqnarray} \frac{\partial c}{\partial W} & = & \frac{\partial c_0}{\partial W} + \frac{\lambda}{n} w \tag{ 88}\ \frac{\partial c}{\partial B} & = & \frac{\partial c_0}{\partial b}.\tag{89}\end{eqnarray}\]
Among them,\ (\partial c_0/\partial w\) and \ (\partial c_0/\partial b\) can be calculated by BP algorithm, therefore, the new partial derivative is easy to calculate:
\[\begin{eqnarray} W & \rightarrow & W-\eta \frac{\partial c_0}{\partial W}-\frac{\eta \lambda}{n} w \tag{91}\ &am P = & \left (1-\frac{\eta \lambda}{n}\right) w-\eta \frac{\partial c_0}{\partial W}. \tag{92}\end{eqnarray} \ \]

\[\begin{eqnarray}b & \rightarrow & B-\eta \frac{\partial c_0}{\partial b}.\tag{90}\end{eqnarray}\]

In batch training, the gradient descent formula becomes:
\[\begin{eqnarray} w \rightarrow \left (1-\frac{\eta \lambda}{n}\right) w-\frac{\eta}{m} \sum_x \frac{\partial C_x}{\ Partial w}, \tag{93}\end{eqnarray}\]
(Note that the first half of the equation is the training data size n, the second part is the batch training m)

Now, in the example of 1000 training samples, we add a regularization item (\ (\lambda\) set to 0.1, the other parameters are the same as before) and see how the training results:

It can be seen that the accuracy rate is significantly higher than the previous 82.27%, that is to say, regularization does suppress overfitting to some extent.

Now, we train with all 50000 pictures to see if regularization works (here we set \ (\lambda\) to 5.0 because n is changed from 1000 to 50000, if \ (\lambda\) is the same value as before , then \ (\frac{\eta \lambda}{n}\) The value will be small very big, weight decay effect will be greatly discounted).

As you can see, the accuracy rate rises to 96.49%, and the gap between the accuracy of the test set and the accuracy of the training set is further narrowed.

Why regularization can reduce overfitting

This question can be explained with an Occam razor (Razor) . The idea of the Ames Razor is that if two models can fit the data, then we prefer a simple model.

The effect of regularization on neural networks is that weights (absolute values) are smaller. The advantage of a small weight is that when the input changes slightly, the result of the network does not fluctuate greatly, and conversely, if the weight (absolute value) is too large, a little change can also produce a large response (including noise). From this point of view, we can think of the regularization of the network is a relatively simple model.

Of course, a simple model may not be really useful, but the key is to see if the model's generalization ability is good enough. On regularization, people have been unable to find the explanation of systematic science. Due to the fact that the regularization effect is often good in neural networks, in most cases we will be regularization of the network.

L1 regularization of other regularization techniques

The L1 regularization form is similar to the L2, except that the regularization term is slightly different:
\[c=c_0+\frac{\lambda}{n}\sum_w{|w|} \tag{95}\]
Let's look at the impact of L1 regularization on the network.

First, we have a 95-type bias guide:
\[\begin{eqnarray} \frac{\partial c}{\partial W} = \frac{\partial c_0}{\partial W} + \frac{\lambda}{n} \, {\rm sgn} (w), \ta G{96}\end{eqnarray}\]
where \({\rm sgn} (w) \) represents \ (w\) symbol, if \ (w\) is positive, then +1, otherwise-1.

In this way, the gradient drop formula becomes:
\[w \rightarrow W ' =w-\frac{\eta \lambda}{n}{\rm sgn} (w)-\eta \frac{\partial c_0}{\partial W} \tag{97}\]
Comparing L2 's formula (93), we found that both formulas have the function of narrowing the weight, which is consistent with the reason that the previous analysis of regularization can work. But the way weight shrinks is different. In L1, regularization allows weight to approach 0 with a fixed constant (weight is exactly the same as negative), while the weight reduction in L2 has a proportional relationship with weight itself (that is, the smaller the weight, the smaller the amount). Therefore, when the absolute value of the weight is large, the inhibitory effect of L2 on weight is greater than that of L1.

In the above formula, there is a flaw: when \ (w=0\) ,\ (|w|\) is not able to derivative. This time, we just need to simply make \ ({\rm sgn} (w) =0\) .

Dropout

Dropout and L1, L2 are very different, it does not modify the cost function, instead, it modifies the structure of the network.

Suppose we want to train the following networks:

When the gradient drops, dropout randomly deletes half of the neurons in the hidden layer, as follows (dashed lines indicate deleted neurons):

Let the network in this "incomplete" state of training.

When we start the next round of batch training, we restore the complete network, then continue to randomly delete half of the neurons in the hidden layer, and then train the network. So loop until the training is over.

When we want to use the network prediction, we will recover all the neurons. Since only half of the neurons in the training are started, each neuron's weight is equal to twice times that of the complete network, so when we really use the network predictions, we want to take half the weight of the hidden layer.

Dropout's ideas are understandable: Suppose we train many networks of the same structure in standard mode (no dropout), and because each network is initialized differently, the training data will be different, so the output of each network will differ. Finally, we take the mean of the results of all the networks as the final result (like a random forest voting mechanism). For example, we trained 5 networks and 3 networks classified the numbers as "3", so we could think of the result as "3", because the other two networks might have gone wrong. This average strategy is powerful because different networks may have been fitted to varying degrees, and averaging values can alleviate a certain degree of overfitting. Dropout will drop some neurons at each training session, which is like training different networks, the dropout process is like the result of an average number of networks, so it ultimately plays a role in reducing overfitfing.

Manually expanding training data

In addition to dropout, extended training data is also an effective strategy for mitigating overfitting.

To understand the effect of the training data set on the results, we are ready to do several sets of experiments. The training set size of each group of experiments is different, the number of rounds and regularization parameters \ (\lambda\) will be adjusted accordingly, the other parameters remain unchanged.

As shown in the center, the increase in the amount of training data helps improve the accuracy of classification. The results in the graph appear to be converging, but with logarithmic coordinates, the effect is even more pronounced:

Therefore, if we can extend the data set to hundreds stadiums, the accuracy rate should continue to rise.

Getting more training data is difficult, but fortunately we have other techniques to do the same thing, which is to manually expand the data.

For example, we have a MNIST training picture:

After turning 15o, we get another sample image:

Both images can be seen as "5", but at the pixel level they are very different and therefore a good training sample. Repeating this practice (rotation translation, etc.), we can get several times the size of the original training data set sample.

This approach has been effective and has been successful in many experiments. Moreover, this idea is not confined to image recognition, which works equally well in other tasks, such as speech recognition.

In addition, the amount of data can compensate for the lack of machine learning algorithms. Assuming that the same data size, the algorithm A is better than the algorithm B, but if you provide more data for algorithm B, the latter will often be more effective than the former. And, even if the data is the same size, but the data of algorithm B is richer than A, B may be more than a, which is the so-called good algorithm is inferior to good data .

Reference

Improving the neural networks learn

Reading notes: Neuralnetworksanddeeplearning Chapter3 (2)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More