Course Introduction
This article mainly describes how to evaluate the performance of the model through validation and how to select the model by validating the problem. The methods are as follows: Model selection, cross-validation.
Course Outline:
1. The Validation Set
2. Model Selection
3. Cross Validation
1. The Validation Set
Our goal is to find the parameters that make eout the smallest. Because the value of eout cannot be directly known, it is necessary to pass the indirect method.
Through the previous course we know: Eout = Ein + overfit penalty (cost of over-fitting)
The effect of overfit penalty can be reduced by regulation.
The above method is solved by the sample data. Let's take a look at how the data outside the sample approximates eout.
First, we need data from outside the sample. First we assume that there is only one sample of data: (x, Y)
Then E (h (x), y) is our approximation of the eout. E is a function of measuring distances. You can use the variance or any appropriate method.
Mathematical expectation of E:
E (E (H (x, y)) = Eout (h)
The approximate ability of e is represented by variance: Var (e (H (x), y)) = (E–eout (h)) ^2 = a ^2
Because there is only one point, the approximate ability cannot be guaranteed. But if we have a lot of points, the situation will improve: for example, now we have a data set outside the sample (K data), then Eval (h) = 1/k (E1+e2+...+ek)
Approximate eout with Eval.
There are also:
For the second line above, the intersection should be added, but since our data points are random, the expectation of the cross item is 0, so it is not written down.
Obviously, when K increases, Eval can be a better approximation of eout, but the question is how do we get the data out of our sample?
In general, our data is limited, and the out-of-sample data is actually obtained by reducing training data samples. So as we increase the K, the Eout is also increasing (because the data in the sample is reduced), so although we are able to approximate the eout well, the approximation has lost its meaning because of the eout increase.
So the question is, how do we choose K?
K small–> Approximate capacity difference
K-big--Eout increase
Rule of thumb:
K = N/5
In order to make better use of the limited data set, the performance of H can be evaluated by the above method, and if it is good, it will be trained again with all the data so that all the data can be used. If you extend this method, you get the following method.
2. Model Selection
Main idea: To train multiple candidate models with the rest of the data by extracting part of the data from the sample data as the data outside the sample. The model is then tested with data from outside the sample, with the best-performing (Evalhi ()) minimum model as the final model, and then re-trained with all the sample data. Get the final result.
However, there is no avoidable noise due to the data collected. Therefore, the best model obtained above is not necessarily the best (analogous to the content in the pseudo-). In order to be able to better find the best model. We want to start with: K The smaller the better (k the smaller, the training data will be increased, the resulting eout will be small, the degree of overfitting will be reduced), and we hope that the larger the better. The larger the K, the more The Eval (h) approximates the eout. But can this contradictory problem be solved well? How do I fix it?
The answer is yes. Cross-validation is a good way to resolve this contradiction. But the training time must be sacrificed.
3. Cross Validation
The main idea: each time a strategy from the training to extract a small number of different (such as 1 data) data as sample data, the rest of the data as training data training. Perform the above procedure several times (e.g. n times, n is the sample data size). Each result is then averaged to obtain the eval (h) for each model. Since only a small fraction (K small) is withdrawn each time, the training of the model has little effect. With multiple repetitions, it is possible to calculate eval (h) to be able to utilize multiple data of the sample (K big).
This makes it possible to achieve good predictive performance. Similar to the second lecture. Finally, we find the best performance model and use all the data to train again to get the final result.
How do you choose the data from the sample during each training session? How many training sessions should be conducted before they can be completed?
In general, the number of sessions = total size of the sample/out-of-sample data. Size
How many data should you choose to use as an out-of-sample data?
The different requirements have different options, but one rule of thumb is:
Out-of-sample data size = Total size of the sample/10.
As to how the sample data should be selected, we can refer to the following methods:
1, each time choose different data as the sample data, the data is not repeated selection.
2. Randomly select a certain number of out-of-sample data at a time.
...
California Institute of Technology Open Class: machine learning and Data Mining _validation (13th session)