Regularization methods: L1 and L2 regularization, data set amplification, dropout

Last Update:2015-09-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://blog.csdn.net/u012162613/article/details/44261657

This article is part of the third chapter of "Neural networks and deep learning", which is about the regularization methods commonly used in machine learning/depth learning algorithms. (This article will continue to add)

regularization method: Prevent overfitting and improve generalization ability

When the training data is not enough, or overtraining, it often leads to overfitting (over fitting). As shown in the visual representation, as the training process progresses, the complexity of the model increases, the error on training data decreases, but the error on the validation set is gradually increasing--because the trained network is fitting the training set, and the data outside the training set is not working.

In order to prevent overfitting, there are many ways to use it, and the following will unfold. One concept needs to be explained first, in machine learning algorithms, we often divide the original data set into three parts: training data, validation data,testing data. What is this validation data? It is used to avoid overfitting, and in the course of training, we usually use it to determine some hyper-parameters (for example, according to the accuracy on validation data to determine the epoch size of early stopping, according to validation Data determines learning rate, etc.). So why not just do this on testing data? Because if we do this in testing data, then as the training progresses, our network is actually overfitting our testing data at 1.1 o ' all, resulting in the testing accuracy of the last one having no referential significance. Therefore, the role of training data is to calculate the gradient update weights, validation data as described above, testing data gives a accuracy to determine the quality of the network.

There are many ways to avoid overfitting: early stopping, DataSet amplification (data augmentation), regularization (regularization) including L1, L2 (L2 regularization also called weight decay), Dropout.

L2 regularization (weight decay)

L2 regularization is the addition of a regularization item after the cost function:

The C0 represents the primitive cost function, and the latter is the L2 regularization item, which comes in this way: the sum of the squares of all parameters W, divided by the sample size n of the training set. λ is the regular term coefficient, weighing the proportion of the regular term with the C0 term. There is also a coefficient 1/2,1/2 often see, mainly for the subsequent derivation of the results of convenience, the subsequent derivation will produce a 2, and 1/2 multiplied just rounding up.

L2 How to avoid overfitting? Let's take a look at the derivation first:

You can see that the L2 regularization item has no effect on the update of B, but has an impact on the update for W:

When the L2 regularization is not used, the derivative result of the W front coefficient is 1, now W front coefficient is 1−ηλ/n, because η, λ, n are positive, so 1−ηλ/n less than 1, its effect is to reduce the W, which is the weight attenuation (weight decay) origin. Of course, considering the derivative term later, the final value of W may increase or decrease.

In addition, it is necessary to mention that for Mini-batch-based random gradient descent, the W and b update formulas are somewhat different from the above:

Compare the above W update formula, you can find that the next item changed, into all derivative sums, multiplied by the η and divided by M,m is a mini-batch in the number of samples.

So far, we've just explained that the L2 regularization has the effect of making w "smaller", but it doesn't explain why W "getting smaller" can prevent overfitting? A so-called "obvious" explanation is: a smaller weight of W, in a sense, the complexity of the network is lower, the data to fit just fine (this law is also known as the Ames Razor), and in practical applications, it is also verified that the effect of L2 regularization is often better than the effect of non-regularization. Of course, for many people (including me), this explanation seems less obvious, so add a slightly more mathematical explanation here:

When overfitting, the coefficients of the fitted function are often very large, why? As shown, overfitting is a fitting function that requires scruples of every point, and the resulting fitting function fluctuates greatly. In some very small intervals, the function values change very sharply. This means that the function is very large in some cell values (absolute value), because the argument can be large and small, so only the coefficient is large enough to ensure that the guide value is very large.

Regularization is not too large by constraining the norm of the parameter, so it can be reduced to a certain extent.

L1 regularization

The original cost function is followed by a L1 regularization term, that is, the sum of the absolute value of the weight of the W, multiplied by the λ/n (which, unlike the L2 regularization term, needs to be multiplied by 1/2, as stated above. ）

Also calculate the derivative first:

The symbol of the W in the upper-style (W) represents. Then the update rule for the weight w is:

η*λ* SGN (W)/n is more than the original update rule. When W is positive, the updated W becomes smaller. When W is negative, the updated W becomes larger-so its effect is to let W go to 0, so that the weight in the network as much as 0, it is equivalent to reduce the complexity of the network, to prevent overfitting.

In addition, the above does not mention a problem, when W is 0 o'clock what to do? When W equals 0 o'clock, | W| is non-conductive, so we can only follow the original non-regularization method to update W, which is equivalent to removing the Η*Λ*SGN (w)/n, so we may specify SGN (0) = 0, so that the w=0 situation is unified in. (in programming, make SGN (0) =0,sgn (w>0) =1,sgn (w<0) =-1)

Dropout

L1, L2 regularization is achieved by modifying the cost function, and dropout is achieved by modifying the neural network itself, which is a technique used in training the network (trike). Its flow is as follows:

Suppose we want to train this network, at the beginning of the training, we randomly "delete" half of the hidden layer units, as they do not exist, get the following network:

Keep the input and output layers intact and update the weights in the neural network according to the BP algorithm (the dashed-line elements are not updated because they are "temporarily deleted").

This is the process of an iteration, in the second iteration, the same method, but the deletion of the half of the hidden layer unit, with the last delete is definitely not the same, because we each iteration is "random" to delete half. Third, fourth time ... This is the case until the training is over.

The above is dropout, why does it help prevent overfitting? Can simply explain, using the dropout training process, equivalent to training a lot of only half of the hidden layer of the neural Network (hereafter referred to as "half of the Network"), each of such a half of the network, can give a classification results, some of these results are correct, and some are wrong. As the training progresses, the majority of the network can give the correct classification results, so a few error classification results will not have a big impact on the final result.

More in-depth understanding, you can see Hinton and Alex two cattle 2012 paper "ImageNet Classification with deep convolutional neural Networks"

DataSet Amplification (data augmentation)

"Sometimes it's not because the algorithm wins, but because it has more data to win." ”

I don't remember what Daniel said, Hinton? How important it is to see training data, especially in deep learning methods, is that more training data means that you can train better models with a deeper network.

In that case, is it okay to collect more data? If you can collect more data to use, of course it's good. But most of the time, the collection of more data means that the need to spend more manpower and material resources, have to get the manual labeling of the students know that the efficiency is particularly low, is simply menial.

So, you can make some changes on the original data, get more data, with the image data set example, you can do a variety of transformations, such as:

Rotate the original picture a small angle
adding random noise
Some resilient distortions (elastic distortions), thesis "Best Practices for convolutional neural networks applied to visual document Analysis" on MN The IST has been amplified by various variants.
Intercept (crop) part of the original picture. In Deepid, for example, 100 small patches were intercepted as training data from a face graph, which greatly increased the data set. Interested can see "deep learning face representation from predicting classes".
What does more data mean?

Using 50,000 mnist samples to train the accuracy94.48% of SVM, using 5,000 mnist samples to train nn to accuracy 93.24%, so more data can make the algorithm perform better. In machine learning, the algorithm itself does not decide the outcome, can not arbitrarily say that these algorithms who are inferior, because the data on the performance of the algorithm is very large.

Regularization methods: L1 and L2 regularization, data set amplification, dropout

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Regularization methods: L1 and L2 regularization, data set amplification, dropout

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Regularization methods: L1 and L2 regularization, data set amplification, dropout

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support