This article is part of the third chapter of "Neural networks and deep learning", which is about the regularization methods commonly used in machine learning/depth learning algorithms. (This article will continue to add)
Regularization method: Prevent overfitting and improve generalization ability
When the training data is not enough, or overtraining, it often leads to overfitting (over fitting). Its intuitive performance as shown, with the training process, the network on the training data error gradually decreased, but in the validation set of error but gradually increased-because the trained network over the training set, the data outside the training set is not work.
In order to prevent overfitting, there are many ways to use it, and the following will unfold. One concept needs to be explained first, in machine learning algorithms, we often divide the original data set into three parts: training data, validation data,testing data. What is this validation data? It is used to avoid overfitting, and in the course of training, we usually use it to determine some hyper-parameters (for example, according to the accuracy on validation data to determine the epoch size of early stopping, according to validation Data determines learning rate, etc.). So why not just do this on testing data? Because if we do this in testing data, then as the training progresses, our network is actually overfitting our testing data at 1.1 o ' all, resulting in the testing accuracy of the last one having no referential significance. Therefore, the role of training data is to calculate the gradient update weights, validation data as described above, testing data gives a accuracy to determine the quality of the network.
There are many ways to avoid overfitting: early stopping, DataSet amplification (data augmentation), regularization (regularization) including L1, L2 (L2 regularization also called weight decay), Dropout.
L2 regularization (weight decay)
L2 regularization is the addition of a regularization item after the cost function:
The C0 represents the primitive cost function, and the latter is the L2 regularization item, which comes in this way: the sum of the squares of all parameters W, divided by the sample size n of the training set. λ is the regular term coefficient, weighing the proportion of the regular term with the C0 term. There is also a coefficient 1/2,1/2 often see, mainly for the subsequent derivation of the results of convenience, the subsequent derivation will produce a 2, and 1/2 multiplied just rounding up.
L2 How to avoid overfitting? Let's take a look at the derivation first:
You can see that the L2 regularization item has no effect on the update of B, but has an impact on the update for W:
When the L2 regularization is not used, the derivative result of the W front coefficient is 1, now W front coefficient is 1?ηλ/n, because η, λ, n are positive, so 1?ηλ/n less than 1, its effect is to reduce the W, which is the weight attenuation (weight decay) origin. Of course, considering the derivative term later, the final value of W may increase or decrease.
In addition, it is necessary to mention that for Mini-batch-based random gradient descent, the W and b update formulas are somewhat different from the above:
Compare the above W update formula, you can find that the next item changed, into all derivative sums, multiplied by the η and divided by M,m is a mini-batch in the number of samples.
So far, we've just explained that the L2 regularization has the effect of making w "smaller", but it doesn't explain why W "getting smaller" can prevent overfitting? It is generally accepted that a smaller weight of W, in a sense, means that the complexity of the network is lower and the data is fitted just fine (this rule is also called the Ames Razor). In practical application, it is proved that the effect of L2 regularization is better than that of non-regularization.
L1 regularization
The original cost function is followed by a L1 regularization term, that is, the sum of the absolute value of the weight of the W, multiplied by the λ/n (which, unlike the L2 regularization term, needs to be multiplied by 1/2, as stated above. )
Also calculate the derivative first:
The symbol of the W in the upper-style (W) represents. Then the update rule for the weight w is:
η*λ* SGN (W)/n is more than the original update rule. When W is positive, the updated W becomes smaller. When W is negative, the updated W becomes larger-so its effect is to let W go to 0, so that the weight in the network as much as 0, it is equivalent to reduce the complexity of the network, to prevent overfitting.
In addition, the above does not mention a problem, when W is 0 o'clock what to do? When W equals 0 o'clock, | W| is non-conductive, so we can only follow the original non-regularization method to update W, which is equivalent to removing the Η*Λ*SGN (w)/n, so we may specify SGN (0) = 0, so that the w=0 situation is unified in. (in programming, make SGN (0) =0,sgn (w>0) =1,sgn (w<0) =-1)
Dropout
L1, L2 regularization is achieved by modifying the cost function, and dropout is achieved by modifying the neural network itself, which is a technique used in training the network (trike). Its flow is as follows:
Suppose we want to train this network, at the beginning of the training, we randomly "delete" half of the hidden layer units, as they do not exist, get the following network:
Keep the input and output layers intact and update the weights in the neural network according to the BP algorithm (the dashed-line elements are not updated because they are "temporarily deleted").
This is the process of an iteration, in the second iteration, the same method, but the deletion of the half of the hidden layer unit, with the last delete is definitely not the same, because we each iteration is "random" to delete half. Third, fourth time ... This is the case until the training is over.
The above is dropout, why does it help prevent overfitting? Can simply explain, using the dropout training process, equivalent to training a lot of only half of the hidden layer of the neural Network (hereafter referred to as "half of the Network"), each of such a half of the network, can give a classification results, some of these results are correct, and some are wrong. As the training progresses, the majority of the network can give the correct classification results, so a few error classification results will not have a big impact on the final result.
More in-depth understanding, you can see Hinton and Alex two cattle 2012 paper "ImageNet Classification with deep convolutional neural Networks"
DataSet amplification (Data augmentation)
"Sometimes it's not because the algorithm wins, but because it has more data to win." ”
I don't remember what Daniel said, Hinton? How important it is to see training data, especially in deep learning methods, is that more training data means that you can train better models with a deeper network.
In that case, is it okay to collect more data? If you can collect more data to use, of course it's good. But most of the time, the collection of more data means that the need to spend more manpower and material resources, have to get the manual labeling of the students know that the efficiency is particularly low, is simply menial.
So, you can make some changes on the original data, get more data, with the image data set example, you can do a variety of transformations, such as:
Rotate the original picture a small angle
adding random noise
Some resilient distortions (elastic distortions), thesis "Best Practices for convolutional neural networks applied to visual document Analysis" on MN The IST has been amplified by various variants.
Intercept (crop) part of the original picture. In Deepid, for example, 100 small patches were intercepted as training data from a face graph, which greatly increased the data set. Interested can see "deep learning face representation from predicting classes".
What does more data mean?
Using 50,000 mnist samples to train the accuracy94.48% of SVM, using 5,000 mnist samples to train nn to accuracy 93.24%, so more data can make the algorithm perform better. In machine learning, the algorithm itself does not decide the outcome, can not arbitrarily say that these algorithms who are inferior, because the data on the performance of the algorithm is very large.
Reprint Please specify source: http://blog.csdn.net/u012162613/article/details/44261657
Regularization methods: L1 and L2 regularization, data set amplification, dropout