Reprint: http://blog.csdn.net/u012162613/article/details/44261657
This article is part of the third chapter of the overview of neural networks and deep learning, which is a common regularization method in machine learning/depth learning algorithms. (This article will continue to add) regularization method: Prevent over fitting, improve generalization ability
When training data is not enough, or overtraining, it often leads to overfitting (over fitting). Its intuitive performance is shown in the following illustration, with the development of the training process, the complexity of the model increases, the error on the training data decreases, but the error on the validation set increases gradually-because the trained network is fitted with the training set, the data outside the training set is not work.
In order to prevent overfitting, there are a number of methods available, which will be expanded below. One concept needs to be explained first, in machine learning algorithms, we often divide the original dataset into three parts: training data, validation data,testing data. What this validation data is. It is actually used to avoid fitting, and in the course of training, we usually use it to determine some of the parameters (for example, according to the accuracy on validation data to determine the epoch size of the early stopping, according to validation Data determines learning rate, etc.). So why not do this directly on testing data. Because if you do this in testing data, then as the training progresses, our network is actually overfitting our testing data at 1.1 points, resulting in testing accuracy without any reference meaning. Therefore, the role of training data is to compute the gradient update weights, validation data as described above, testing data gives a accuracy to judge the quality of the network.
There are many ways to avoid fitting: early stopping, DataSet amplification (data augmentation), regularization (regularization) including L1, L2 (L2 regularization also known as weight decay), Dropout. L2 regularization (weight attenuation)
L2 regularization is the addition of a regularization item after the cost function:
C0 represents the original cost function, and the latter is the L2, which is the sum of the squares of all the arguments w squared, divided by the sample size n of the training set. λ is the regular term factor, weighing the proportion between regular term and C0 term. There is also a factor 1/2,1/2 often see, mainly for the results of the following derivation is convenient, the next derivative will produce a 2, and 1/2 multiply just rounding.
L2 the regular term is how to avoid overfitting. Let's take a look at the derivation first:
You can see that the L2 regularization item has no effect on the update for B, but it has an effect on the update for W:
When not using L2 regularization, the derivation results of the W-front coefficient of 1, now W front coefficient is 1−ηλ/n, because η, λ, n are positive, so the 1−ηλ/n is less than 1, its effect is to reduce the W, which is the weight attenuation (weight decay) the origin. Of course, given the derivative of the following, the final value of W may increase or decrease.
Also, it is necessary to mention that the W and b update formulas are somewhat different from those given above for random gradient descent based on Mini-batch:
In contrast to the above W's update formula, we can see that the following item has changed to become all derivative tax, multiplied by η and divided by M,m is the number of samples in a mini-batch.
So far, we have just explained that the L2 regularization has the effect of making w "smaller", but has not explained why W "becomes smaller" to prevent overfitting. One of the so-called "obvious" explanations is: the smaller weight value w, in a sense, indicates that the complexity of the network is lower, the fitting of the data is just good (this law is also called the L2 Razor), and in practical applications, this is also validated, the effect of regularization is often better than the effect of regularization. Of course, for a lot of people (including me), this explanation doesn't seem so obvious, so add a little bit of math to the explanation (which leads to self-knowledge):
When fitting, the coefficients of the fitted function are often very large, why. As shown in the following figure, the fitting function is to be fitted with the need to scruple each point, the resulting fitting function fluctuates greatly. In some very small intervals, the value of the function changes very dramatically. This means that the function in some of the cells in the Guide value (absolute) is very large, because the value of the argument can be large and small, so only the coefficient is large enough to ensure that the guide value is very large.
Regularization is to make it not too large by the norm of constraint parameter, so it can reduce the fitting situation to some extent. L1 regularization
After the original cost function, add a L1 regularization term, which is the sum of the absolute value of the ownership weight W, multiplied by the λ/n (not like the L2 regularization term, which needs to be multiplied by 1/2 for specific reasons mentioned above.) )
The derivative is also calculated first:
SGN (w) in the upper-type represents the symbol for W. Then the update rule for the weight w is:
η*λ* SGN (W)/n is more than the original update rule. When W is positive, the updated W becomes smaller. When W is negative, the updated W becomes larger-hence its effect is to let W to 0, so that the weight of the network as much as 0, it is equivalent to reduce the network complexity, to prevent the fit.
In addition, the above does not mention a problem, when W for 0 o'clock how to do. When W equals 0 o'clock, | W| is not a guide, so we can only update w according to the original, not regular method, which is equivalent to removing Η*Λ*SGN (W)/n, so we could specify SGN (0) = 0, so that the w=0 situation is also unified in. (at the time of programming, make SGN (0) =0,sgn (w>0) =1,sgn (w<0) =-1) dropout
L1, L2 regularization is achieved by modifying the cost function, and dropout is implemented by modifying the neural network itself, which is a technique used in training the network (trike). Its process is as follows:
Let's say we're going to train this network, and at the beginning of the training, we randomly "delete" half of the hidden-layer units, see them as non-existent, and get the following networks:
Keep the input and output layer unchanged and update the weights in the above graph neural network according to the BP algorithm (the dotted line connected units are not updated because they are "temporarily deleted").
The above is an iterative process, in the second iteration, the same method is used, except that the half of the hidden layer that was deleted this time is definitely different from the one that was deleted, because each iteration is "random" to delete half. Third time, fourth time ... All the same, until the end of the training.
The above is dropout, why it helps prevent the fitting. You can simply explain this, using the dropout training process, the equivalent of training a number of only half of the hidden layer of the Neural Network (hereinafter referred to as "half of the Network"), each of these half of the network can give a classification results, some of these results are correct, some wrong. As the training progresses, most half of the networks can give the correct classification results, so a few erroneous classification results will not have a significant impact on the final result.
To get a deeper understanding, you can look at the Hinton and Alex two niu 2012 's paper "Imagenet classification with Deep convolutional neural Networks" DataSet amplification (data Augmentation)
"Sometimes it's not because the algorithm wins, but because it has more data to win," he said. ”
I don't remember what Daniel said in the exact words, Hinton. The importance of training data, especially in depth learning methods, means that more training data can be used to train better models with deeper networks.
In that case, it would be fine to collect more data. If you can collect more available data, of course, good. However, many times, the collection of more data means that the need to spend more manpower and material resources, have been the manual annotation of the students know that the efficiency is particularly low, is simply menial.
So, you can make some changes in the original data, get more data, with the picture data set for example, you can do various transformations, such as:
Rotate the original picture to a small angle
Add Random noise
Some elastic distortions (elastic distortions), paper "Best Practices for convolutional neural Networks-visual document Analysis" for MN IST has been amplified by various variants.
Intercepts (crop) part of the original picture. In Deepid, for example, 100 small patch were intercepted as training data from a human face map, which greatly increased the dataset. Interested can see "Deep learning face representation from predicting 10,000 classes".
What more data means.
Using 50,000 mnist samples to train SVM to obtain the accuracy94.48%, with 5,000 mnist samples training nn to get accuracy 93.24%, so more data can make the algorithm performance better. In machine learning, the algorithm itself can not decide the outcome, can not arbitrarily say that these algorithms who are inferior, because the data on the performance of the algorithm is very large.
Reprint Please indicate the source: http://blog.csdn.net/u012162613/article/details/44261657