We already know that we want to have a generalization ability of models learned through machine learning. In a straightforward way, it is that the learned model not only works well in the training samples, but also works in new samples. well.
Usually we refer to the ratio of the number of samples with the wrong classification to the total number of samples as the error rate, and the accuracy = 1 - the error rate. For example, if there are a sample classification errors in m samples, the error rate is a/m and the accuracy is 1 - a/m.
More generally, we refer to the error of the model on the training set as the train error, and the error on the test set is called the test error, on the premise that the test data is independent and identical to the real data. The test error can be used as an approximation of the generalization error. Of course, we hope to get a model with small generalization error, but since we don't know what the test set looks like in advance, we do not fix the parameters in advance when using the machine learning algorithm, but adjust the parameters to reduce the training error. Throughout the process, the expectation of generalization error will be greater than or equal to the expectation of training error.
To ensure that the model works well, that is, the generalization error is relatively small, it can be achieved by the following two factors:
Reduce training errors.
Reduce the gap between training and generalization errors.
These two factors correspond to the two main challenges of machine learning: underfitting and overfitting. Under-fitting means that the model can't get a sufficiently low error on the training set, and over-fitting means that the difference between the training error and the generalization error is too large.
Under-fitting is relatively easy to understand, and over-fitting may not be easy to understand. Here is an image metaphor. When I went to school, some people took the sea tactics and put every problem down. But the topic is a little changed, he will not do it. Because he is very complicated to remember the practice of each question, and does not abstract the general rules. Intuitively understood, this situation is over-fitting.
Let's take a picture to visually understand the difference between under-fitting and over-fitting. There is a set of datasets containing feature X and tag Y. For simplicity, we only consider one feature. Now we fit the same data through three hypothetical functions, and the results are as follows:
Explain the content of the image: (left) The model obtained by fitting the data with a linear function results in an under-fitting—it cannot capture the curvature information in the data. (middle) The model obtained by fitting the data with a quadratic function is generalized at an unobserved point. This does not result in a significant under-fitting or over-fitting. (Right) A model obtained by fitting data with a high-order polynomial results in overfitting.
Compared with under-fitting, over-fitting is more common in practical work. The reasons for over-fitting are as follows:
The training set and the test set are inconsistently distributed.
The order of magnitude of the training set does not match the complexity of the model. The order of magnitude of the training set is less than the complexity of the model.
The noise data in the sample is too much interference, so large that the model over-remembers the noise characteristics, but ignores the relationship between the actual input and output.
No free lunch theorem
We often hear people say "what algorithm is better" or "A algorithm is better than B algorithm". In fact, this statement ignores a premise, that is, in solving a specific problem (task). Why do you say this, because if you consider all potential problems, all learning algorithms are just as good. To talk about the relative merits of an algorithm, it must be meaningful for specific problems and tasks.
For example, we have two sets of data sets, which are divided into training sets and test sets, respectively, and then fit the two training sets by assuming that functions A and B respectively.
In the data distribution in (a), both models A and B fit perfectly to all training samples, but the performance of model A exceeds that of model B in the performance of the test sample; in Figure (b) Under the data distribution, models A and B also fit all the training samples perfectly, but the performance of model B exceeds that of model A.
The above example fully illustrates that no algorithm is absolutely perfect, each algorithm has its own suitable data set, which means that it is suitable for the task.
No Free Lunch Theorem (NFL) says that no matter what the most advanced algorithms we can imagine or how clumsy algorithms (such as random guesses), their expected performance is actually the same.
This conclusion seems very "inconceivable", but in fact it is established. However, the premise of its establishment is in all possible data distribution. But in real life, we often only pay attention to the problems we are trying to solve, and hope to find a good solution for it. Take a real life scene, for example, your home subway has 800m, you want to quickly reach the subway when you go to work, then choose to share a bicycle is a good solution; but if you want to quickly from Beijing to Shanghai, this At the time, sharing a bicycle is obviously not a good choice. So what we are concerned about is to find a suitable solution for the problem that is currently being solved (or under the task).
Therefore, the goal of machine learning research is not to find a general learning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what distribution is related to the “real world” of machine learning acquisition experience, and what learning algorithms work best for the data generation distribution we care about.
Deviation and variance
We hope that while we can get a model with a small generalization error, we also hope to explain why the generalization error of the model is relatively small (or relatively large). We can decompose the model's expected generalization error into two parts: the sum of bias and variance. which is:
Generalization error = deviation + variance.
The deviation measures the degree of deviation between the expected prediction of the learning algorithm and the real result, that is, the fitting ability of the learning algorithm itself. The variance measures the change in learning performance caused by changes in the same size of the training set, ie, the effects of data perturbations, or the stability of the learning algorithm.
Deviation-variance decomposition shows that generalization performance is determined by the ability of the learning algorithm, the adequacy of the data, and the difficulty of the learning task itself. Given a learning task, in order to achieve good generalization performance, the deviation needs to be small, and the data can be fully fitted and the variance is small, that is, the influence of the data disturbance is small.
Here we explain the deviation and variance by a picture of throwing darts.
The low deviation and low variance in the upper left corner are ideal models, similar to throwing the darts close to the target, and the aggregation effect is also good; the low deviation and high variance in the upper right corner are the performance of the model overfitting. Similar to throwing the darts close to the target, but the effect of the aggregation is not good; the high deviation and low variance of the lower left corner are the performance of the model under-fitting, similar to throwing the darts far away from the target, but They are all concentrated in one position; the high deviation and high variance in the lower right corner are the worst models, similar to throwing darts onto the target, but the darts are far away from the bull's eye and are scattered from each other.
Generally speaking, the deviation and variance are conflicting. That is to say, when the complexity of the model is low, the deviation of the model is higher and the variance is lower. When the complexity of the model is higher, the deviation of the model is lower, and the variance is lower. high. The figure below shows a relationship between them.
Once we have the deviations and variances of the model, we can know how to optimize the algorithm in the next step.
Assume that our classifier model has an error rate of 1% on the training set and 11% on the test set. We estimate that the deviation is 1% and the variance is 11%-1%=10%. Over-fitting, so the next step in optimization should consider reducing the variance.
If the error rate of a classifier model on the training set is 15% and the error rate on the test set is 16%, we estimate that the deviation is 15% and the variance is 16%-15%=1%. Combined, so the next step in optimization should consider reducing the deviation.
Bayesian error
Bayesian error is also called the optimal error. Generally speaking, it refers to the error that occurs in the case where the prior art and the machine can do the best.
Humans are good at many tasks, such as image recognition and speech recognition. The task of processing natural data is not far from human level and Bayesian level. It is usually approximated by Bayesian level by human level, which means human error can be approximated. The ground is seen as Bayesian error.
With Bayesian errors, hungry can decompose the deviation into the sum of Bayesian error and avoidable deviation. which is:
Deviation = Bayesian error + avoidance of deviation
Suppose we train a classifier model with 15% error rate on the training set and 30% error rate on the test set. If the Bayesian error is 14%, then we can know its avoidable error. There are 15%-14%=1%, and the variance is 30%-15%=15%. At this time, we should consider how to reduce the variance instead of reducing the deviation.
Ways to reduce deviations and variances
Reducing the deviation of the model can reduce the risk of under-fitting of the model; reducing the variance of the model can reduce the risk of over-fitting of the model. Here we look at some common methods.
Reduce model deviation
Add new features. For example, mining combination features, context features, and ID class features.
Increase model complexity. For example, adding a high-order term in a linear model increases the number of network layers or the number of neurons in a neural network.
Reduce or remove the regularization coefficient. For example, L1, L2, dropout, etc.
Reduce the variance of the model
Add more training data.
Reduce model complexity. For example, in the decision tree model, the tree depth is reduced, and pruning is performed.
Add regularization. Regularization can be used to constrain the parameters of the model to avoid overfitting.