When the model is more accurate on the training data set, the accuracy on the test data set can be both up and down. What is this for?
Training error and generalization error
Before explaining the above mentioned phenomena, we need to distinguish between the training error (training error) and the generalization error (generalization error): The former refers to the error that the model shows on the training data set, the latter refers to the expectation that the model shows errors on any test data sample.
It is assumed that each sample in the training data set and the test data set is generated independently from the same probability distribution. Based on the hypothesis of independent distribution, the expectation and generalization error of training error are the same for any machine learning model and its parameters and hyper-parameters.
However, the parameters of the model are learned by training data training model, the expectation of training error is less than or equal to the generalization error. That is, typically, the model parameters that are learned from the training data set make the model outperform or equal to the performance on the test data set on the training data set.
Due to the inability to estimate the generalization error from the training error, reducing the training error does not necessarily mean that the generalization error will decrease. We hope to reduce the model's generalization error indirectly by properly reducing the training error of the model.
Under-fitting and over-fitting
Given the test data set, we usually use the machine learning model's error on the test data set to reflect the generalization error. When the model is unable to get a lower training error, we call this phenomenon under-fit (underfitting). When the training error of the model is much smaller than the error in the test data set, we call this phenomenon overfitting (overfitting).
In practice, we should avoid the occurrence of both the fitting and the overfitting as far as possible. While there are a number of factors that can lead to these two fitting problems, here we focus on two factors: model complexity and training data set size.
Complexity of the model
To explain the complexity of the model, let's take the polynomial function fitting as an example. Given a training data set consisting of scalar data feature x and the corresponding scalar label y, the goal of polynomial function fitting is to find a K-order polynomial function
\[\hat{y} = B + \sum_{k=1}^k x^k w_k\]
to approximate Y. In the upper style, the subscript w is the weight parameter of the model, and B is the deviation parameter. As with linear regression, polynomial function fitting also uses a square loss function. In particular, the first order polynomial function fitting is also called linear function fitting.
Higher order polynomial functions are more prone to lower training errors on the same training data set than lower order polynomial functions. Given the training data set, the relationship between the model complexity and the error is usually. Given the training data set, if the complexity of the model is too low, it is easy to be under-fitting, if the model complexity is too high, it is easy to get fit.
Training data Set Size
Another important factor affecting under-fitting and overfitting is the size of the training data set. In general, overfitting is more likely to occur if the training data set is too small, especially if the number of model parameters is less than a few hours.
In addition, the generalization error does not increase with the number of samples in the training data set. Therefore, within the allowable range of computing resources, we usually want to train a larger set of datasets, especially when the model is more complex, such as a deep learning model with a higher number of layers.
Model selection
When choosing a model, we can slice the original training dataset: Most of the samples form a new training data set, and the remaining constituent validation datasets (validation data set).
We train the model on the new training data set and select the model based on the model's performance on the validation data set.
Finally, we evaluate the performance of the model on the test data set.
K-fold cross-validation
In K-fold cross-validation, we split the original training data set into K non-coincident sub-datasets. Then we do K-Times model training and validation. Each time, we use a sub-dataset to validate the model and use other k?1 cubes to train the model. In this k training and validation, each sub-dataset used to validate the model is different. Finally, we only need to calculate the K training error and the verification error respectively as the final training error and verification error.
MXNET: Under-fitting, over-fitting and model selection