Geoffery Hinton Professor Neuron Networks for the eighth lecture for the optional part, as if it is difficult, here first skipped, and later when useful to come back to fill. The Nineth lecture introduces how to avoid overfitting and improve the generalization ability of the model.
This is the course link on Cousera.
Overview of ways to improve generalization
In this section, we describe how to improve the generalization capability of a network model by reducing overfitting when the network has too much capacity to handle the training data set of too much capacity. Here are a few ways to control network capacity and how to set the metric parameters. Review what is over-fitting (overfitting), because it was previously spoken of things, here only paste, no longer text narrative.
There are four ways to prevent overfitting, the second of which-the ability to adjust the network appropriately (regulate the capacity appropriately)-will be highlighted in this lecture, and the latter two will be covered in a later course.
We usually achieve the control of network capability through the following four ways or their combination.
- Architecture:limit the number of hidden layers and the number of units per layer.
- Early Stopping:start with small weights and stop the learning before it overfits.
- Weight-decay:penalize large weights using penalties or constraints on their squared values (L2 penalty) or absolute value S (L1 penalty).
- Noise:add Noise to the weights or the activities.
When using these methods, we need to set some meta-parameters (meta parameters), such as the number of hidden units, number of layers, penalty weights, etc. We might set the meta parameters with some possible values, and then filter out the best set of parameters from the test data set. However, the parameters obtained by this method have pertinence to the test set, and the possibility of exchanging a set of test data sets is not good enough. For example, an extreme example: Suppose there is a test data set, where the output is not derived from the input, but is randomly tagged. Then the meta-parameters filtered by this test set must not withstand the tests of other test sets.
A better approach is to divide the entire data set into training sets, validation sets, test sets, training sets are used to train the parameters of the model, the validation set is used to filter out the best performance parameters, and the test set is used to obtain unbiased estimation of model performance. The unbiased estimation of the performance of this model is certainly lower than the performance on the validation set, as the reason for the interview. To minimize the likelihood of the model overfitting the validation set, we divide the entire data set into n+1 copies of the same size, one for the final Test set, and the other n (recorded as s 1 To s Used as a training set and a validation set: Each time you select one of them as a validation set S i , the remaining N-1 are used as training sets, and then the error rate on the validation set is calculated E i 。 Over here S i From S 1 Traverse to s , then the n error rate is obtained, then the parameter is selected by the average of the n error rate, and then the unbiased estimation of the model performance is obtained by the final Test set. This method is called N-fold cross-validation (N-fold crosses Validation), my narrative is not particularly clear, here is a link: cross-validation introduction.
It is important to note that the N estimates (that is, the error rate) obtained by N-fold cross-validation are not independent of one another. An extreme example of this is that if there happens to be a subset that contains only one category of data, the final result will be a poor generalization, regardless of whether the subset is a training set or a validation set.
For large datasets and large models, it is extremely expensive to retrain models with different meta-parameters over and over again. A low-cost approach is to set up smaller parameters at the beginning of the training and then gradually increase the parameters as the model is trained until the model's performance in the cross-validation set begins to become worse. But it's difficult to measure how it starts to get worse, so we can stop training after we've made sure that the performance has gone bad, and then go back to finding the best performance point.
Hinton also said that the capacity of the model was limited by the lack of time for weights. The following explains why small weights can limit capacity.
Taking into account a neural network shown, the immediate implication of the unit using a logical unit, the small weight parameters will make the input of these units is very close to 0, the output falls in the middle of the logical curve near the line of the paragraph. That is, knowing that the weight parameter makes the logical implication unit behave like a linear unit, the whole network is close to a linear neural network that maps the input directly to the output. With the increase of the weight parameter, the implicit unit restores the capacity of the logical unit, and the fitting ability of the model to the training set begins to increase gradually. The final model will increase and decrease the fitting degree of the validation set, and when the fitting degree starts to decrease, it should be the time to stop training.
Limiting the size of the weights
This section describes how to control the capacity of a network by limiting the size of the weights, and the standard method is to introduce a penalty to prevent the weights from becoming too large. Along with some implicit assumptions, neural networks with small weights are much simpler than power values. We can use several different methods to limit the weight of size, so that the weight vector of the incoming hidden unit does not exceed a certain length.
The standard method is to use the L2 weight penalty term to limit the size of the weight, which refers to the penalty by adding the square of the weight to the loss function. In a neural network, L2 is sometimes called a weight decay because the derivative of the penalty always limits the weight to a larger value. A formula for the loss function is given, in which the coefficients of the sum of squares of weights λ It is called the weight loss (weight cost), which determines the strength of the penalty. When the derivative of the loss function is zero, the value of the weighted value is the maximum value that can be obtained. (Again, the loss function starts to rise again.)
The effect of L2 weight loss is listed.
The penalty term of L1 weight is given, which is the absolute value of the weight, the image is V-shaped, see. L1 weight penalty A good function is to make a lot of weights close to 0, so we can understand what's going on in the neural network (we just need to pay attention to the few weights that are not close to 0). Sometimes we also use penalties that keep a certain number of weights in a larger value.
In addition to the introduction of weight penalty, we can also introduce weight constraints, such as the input weights for each element of the vector, we can constrain its maximum value of the sum of squares must not exceed a certain limit. The advantages of weight constraints are enumerated.
Using Noise as a regularizer
This section describes another way to limit network capacity-using noise regularization (using noise as Regularizer). We can add noise to the weights or activities (that is, the unit), so as to limit the network capacity to prevent the occurrence of overfitting.
Assuming we add Gaussian noise (Gaussian noise) to the input, the variance of the noise before entering the next layer is magnified by the square weight. As shown, in a simple network, the output is linearly correlated with the input, the amplified noise is also added to the output, and the squared error is affected (increased). So when the input has noise, the minimization squared error is actually the sum of the squares of the minimized weights.
Give a mathematical deduction, but do not understand why the square and the expansion of the middle term is omitted. Based on the derivation, we can see that adding noise to the input is actually equivalent to adding a weight penalty.
Adding Gaussian noise to weights in more complex networks is not exactly the same as adding weight penalty, but is better, especially in recurrent neural networks. Alex grave added noise to its cyclic neural network for handwriting recognition, and the results showed a significant improvement in performance.
We can also use noise in activities as regularization (using noise in the activities as a regularizer). Presumably, for an implicit unit that uses a logical function, its output must be between 0 and 1, and now we use a binary function in the forward direction instead of the logic function in the hidden unit, the random output 0 or 1, the output is computed. Then in the reverse, we use the correct method to do the correction. The resulting model may have a poor performance on the training set, and the training speed is slower, but its performance on the test set is significantly improved.
Introduction to the full Bayesian
In this section we introduce the Bayesian method (Bayesian approach) through a simple coin-toss example. The main idea of the Bayesian approach is not to go directly to the model's most likely parameter settings, but to consider all possible parameter settings and to derive the likelihood of each parameter setting based on the existing data. The Bayesian framework is given:
Here is an example of a coin toss to introduce the Bayesian method.
Send it first, and the rest will be better tomorrow.
The Bayesian interpretation of weight Decaymackay ' s quick and dirty method of setting weight costs
Neural networks used in machine learning Nineth Lecture Notes