Regularization methods: L1 and L2 regularization, data set amplification, dropout

Source: Internet
Author: User

This article is part of the third chapter of the neural networks and deep learning overview. The regularization method commonly used in machine learning/deep learning algorithms. (This article will continue to add)

regularization method: Prevent overfitting and improve generalization ability

When the training data is not enough, or overtraining, it often leads to overfitting (over fitting). Its intuitive performance for example with what is seen. As the training process progresses, the complexity of the model is added, and the error on training data is gradually reduced. But the error on the validation set is growing--because the trained network is fitting the training set, the data outside the training set is not working.

In order to prevent overfitting. There are a lot of ways to use this, and the following will unfold. One concept needs to be explained, in machine learning algorithms, we often divide the original data set into three parts: training data, validation data. Testing data. What is this validation data? It's actually designed to avoid overfitting. In the course of training. We often use it to determine the number of super-parameters (for example, based on the accuracy on validation data to determine the epoch size of early stopping, validation rate based on learning data, and so on). So why not just do this on testing data? Given these assumptions at testing data, as the training progresses, our network is actually overfitting our testing data at 1.1 points, leading to the final testing accuracy no matter what the reference. Therefore, the training data function is to calculate the gradient update weights, validation data as described above. Testing data gives a accuracy to infer the good or bad of the network.

There are many ways to avoid overfitting: early stopping, DataSet amplification (data augmentation), regularization (regularization) contains L1, L2 (L2 regularization also called weight decay), Dropout.

L2 regularization (weight decay)

L2 regularization is the addition of a regularization item after the cost function:

C0 represents the original cost function, and the latter is the L2 regularization term. It comes in this way: the sum of the squares of all the parameters W, divided by the sample size n of the training set.

λ is the regular term coefficient, weighing the proportion of the regular term with the C0 term. Another factor 1/2,1/2 often see, mainly for the subsequent derivation of the results of convenience, the subsequent derivation will produce a 2. Multiplied by 1/2 just to the full.

L2 How to avoid overfitting? Let's take a look at the derivation first:

It is possible to discover that the L2 regularization item has no effect on the update of B, but has an impact on the update of W:

When you are not using L2 regularization. In the derivative result, the W front coefficient is 1, and now the W front coefficient is 1?ηλ/n, because η, λ, n are positive. So 1?ηλ/n is less than 1, and its effect is to reduce the W. This is the origin of the weight decay (weight decay).

Of course, given the subsequent derivative term, the value of W may increase or decrease.

Other than that. It is necessary to mention that for Mini-batch-based random gradient descent, the formula W and b updates are somewhat different from the above:

Update the formula against the W above. To be able to find that the next item has changed to become the full derivative sums, multiplied by the η and divided by M,m is the number of samples in a mini-batch.

So far, we've just explained that the L2 regularization has the effect of making w "smaller", but it doesn't explain why W "getting smaller" prevents overfitting? A so-called "obvious" explanation is that a smaller weight of W, in a sense, means that the complexity of the network is lower, the data is fitted just fine (this rule is also called the Ames Razor), and in practical applications, it is verified. The effect of L2 regularization is often better than that of non-regularization. Of course, for very many people (including me), this explanation seems less obvious, so add a slightly more mathematical explanation here:

When overfitting, the coefficients of the fitted function are often very large, why? For example, with what is seen, overfitting. It is the fitting function that requires scruples at each point. The resulting fitting function fluctuates very much. In some very small intervals, the function values change very sharply.

This means that the function in some cell between the Guide value (absolute value) is very large, because the self-variable values can be large and small, so only the coefficient is large enough, the ability to guarantee a very large number of guidance.

Regularization is not too large by the norm of constrained parameters, so it can reduce the fitting condition to some extent.

L1 regularization

Add a L1 regularization item after the original cost function. That is, the sum of the absolute values of all weights w. Multiply λ/n (This is not like the L2 regularization item, it needs to be multiplied by 1/2.) Detailed reasons have been said above. )

The same first computes the derivative:

The symbol of the W in the upper-style (W) represents. Then the update rule for the weight w is:

η*λ* SGN (W)/n is more than the original update rule.

When W is positive, the updated W becomes smaller.

When W is negative. The updated W gets bigger-so the effect is to let W go up to 0. So that the weights in the network as much as 0, it is equivalent to reduce the network complexity, to prevent overfitting.

In addition, the above does not mention a problem, when W is 0 o'clock what to do? When W equals 0 o'clock, | W| is non-conductive. So we can only update w according to the original non-regularization method, which is equivalent to removing the Η*Λ*SGN (w)/n, so we can stipulate that sgn (0) = 0, so that the w=0 situation is also unified in.

(in programming, make SGN (0) =0,sgn (w>0) =1,sgn (w<0) =-1)

Dropout

L1, L2 regularization is achieved by altering the cost function, and dropout is achieved by altering the neural network itself, which is a technique used in training the network (trike). Its processes such as the following:

Suppose we want to train this network, at the beginning of the training, we randomly "delete" half of the hidden layer units, depending on them as non-existent, for example, the following network:

Keep the input and output layer unchanged, updating the weights in the neural network according to the BP algorithm (the dashed connection units are not updated because they are "temporarily deleted").

The above is the process of an iteration, in the second iteration, also in the same way, only the half of the hidden layer of the deletion of the unit, and the last delete is definitely not the same. Because every iteration of us is "random" to delete half.

Third, fourth time ... This is the case until the training is over.

The above is dropout, why does it help prevent overfitting? Can be explained simply by this. Using the dropout training process, the equivalent of training a very many of the only half of the hidden layer of the Neural Network (hereinafter referred to as "half of the Network"), each of these half of the network, can give a classification results, some of these results are correct, some errors.

As the training progresses, most of the networks are able to give the correct classification results. So a small number of incorrect classification results will not have a big impact on the outcome.

More in-depth understanding. Can see Hinton and Alex two 2012 paper "ImageNet Classification with deep convolutional neural Networks"

DataSet Amplification (data augmentation)

"Sometimes it's not because the algorithm wins." It's because we have a lot of other data to win. ”

I don't remember what Daniel said, Hinton? How important it is to see training data, especially in deep learning methods. A lot of other training data. means being able to train a better model with a deeper network.

In this case, collect a lot of other data can not? It is good to assume that you can collect many other data that can be used. But very often, the collection of a lot of other data means that a lot of other human resources are needed. The students who have made the manual mark will know. The efficiency is particularly low, it is simply menial.

So. Can make some changes to the original data, get a lot of other data, with the image data set example, can do various transformations, such as:

    • Rotate the original picture a small angle

    • adding random noise

    • Some elastic aberrations (elastic distortions). The article "Best Practices for convolutional neural networks applied to visual document Analysis" has made various variants of the mnist amplification.

    • Intercept (crop) part of the original picture.

      In Deepid, for example, 100 small patches were taken as training data from a face graph, and data sets were greatly added.

      Interested in can see "deep learning face representation from predicting classes".

      What does a lot of other data mean?

Using 50,000 mnist samples to train the accuracy94.48% of SVM, using 5,000 mnist samples to train nn to accuracy 93.24%, so many other data can make the algorithm perform better.

In machine learning, the algorithm itself does not decide the outcome. It is not possible to arbitrarily say these algorithms who are inferior, because the data on the performance of the algorithm is very large.

Reprint Please specify source: http://blog.csdn.net/u012162613/article/details/44261657

Regularization methods: L1 and L2 regularization, data set amplification, dropout

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.