Overfitting and regularization (overfitting and normalization)
Our network is no longer able to be extended to test data after the 280 iteration period. So this is not useful for learning. We say that the network is over-fitted (overfitting) or over-trained (overtraining) after the 280 iteration.
Our network is actually learning the special case of the training data set, rather than being able to identify it in General . Our network is almost a collection of simple memory training, without the understanding of the nature of the digital and generalization to the test data set.
An obvious way to detect overfitting:
Track the accuracy of the test data set with the training changes. If we see that the accuracy rate on the test data no longer increases, then we stop training.
The specific process is to calculate the classification accuracy on Validation_data at the end of each iteration period. Once the classification accuracy rate is saturated, training is stopped. This strategy is called early stop (early stopping). The validation set can be thought of as a special training data set that can help us learn good hyper-parameters (hyper-parameters). This method of finding good hyper-parameters is sometimes referred to as the hold-out method because Validation_data is a part of the Training_data training set aside or "pulled out".
Normalized L2 normalization (also known as weight decay)
The idea of L2 normalization is to add an extra item to the cost function, which is called a normalization term. Here is the normalized cross-entropy:
Where λ>0 can be called normalization parameters, N is the size of the training set. (It is important to note that the normalization item does not contain a bias)
Of course, other cost functions can also be normalized, such as two-time cost functions:
Both can be written like this:
Where C0 is the primitive cost function.
Normalization can be used as a compromise between finding small weights and minimizing primitive cost functions . The relative importance of the two parts is controlled by the value of λ : the smaller theλ , the more inclined to minimize the original cost function , and conversely, the smaller weight .
Applying the stochastic gradient descent algorithm to a normalized neural network
The partial derivative of the normalized loss function is obtained:
You can see the paranoid gradient drop. Learning rules do not change:
And the weight of learning rules has become:
This is the same as normal gradient descent learning rules, which adds a factor to readjust the weight of W. This adjustment is sometimes called weight decay .
Then, the normalized learning rule for the weight of the random gradient descent becomes:
The normalized learning rule for biasing is unchanged:
Normalization allows the network to have better generalization ability, which significantly reduces the effect of overfitting.
Note: The network without normalization will be limited by chance, obviously trapped in the local optimal out of the cost function. The result is that different runs will give a big difference. In contrast, a normalized network can provide results that are easier to replicate.
Why normalization can help mitigate over-fitting
small weights , in part, mean lower complexity , which is a simpler but more powerful explanation for the data, so it should be preferred.
Neural Network and Deeplearning (3.2) Learning method of improved neural network