Introduction to the idea of cross-validation (Crossvalidation) method
The following is referred to as cross-validation (crosses Validation) for CV.CV is used to verify the performance of the classifier a statistical analysis method, the basic idea is to put the original data (dataset) in a sense of grouping, part of the training set (train set), the other part as a validation set (validation set), first training the classifier with the training set, using the validation set to test the trained model, as the performance index of the evaluation classifier. Common CV methods are as follows:
1). Hold-out Method
The original data is randomly divided into two groups, one set as a training set, a set as a validation set, training classifier using training set, and then using validation set to verify the model, record the final classification accuracy rate for this Hold-outmethod classifier performance index. The benefits of this approach are simple to handle, Just randomly divide the original data into two groups, in fact, strictly speaking Hold-out method is not a CV, because this method does not meet the idea of intersection, because it is randomly grouped raw data, so the final verification set classification accuracy rate and the original data of the group has a great relationship, So the results of this approach are not persuasive.
2). K-fold Cross Validation (recorded as K-CV)
The original data is divided into K-groups (generally, evenly), each subset of data to do a validation set, the rest of the K-1 group subset of data as a training set, so that the K model, with the K model of the final validation set of the average classification accuracy of the K-CV under the classifier performance indicators. K is generally greater than or equal to 2, The actual operation is generally starting from 3, only when the original data collection of small amount of data will try to fetch 2. K-CV can effectively avoid the occurrence of learning and lack of learning, and finally the results are more persuasive.
3). Leave-one-out Cross Validation (recorded as LOO-CV)
If the original data has n samples, then LOO-CV is N-CV, that is, each sample as a validation set, the remaining N-1 samples as a training set, so LOO-CV will get n models, The average of the classification accuracy of the final verification set of these n models is used as the performance index of the lower LOO-CV classifier. There are two distinct advantages compared to the previous K-CV,LOO-CV:
①
A. Almost all of the samples in each round are used to train the model, so it is closest to the distribution of the original sample, which results in a more reliable evaluation.
Ii
B. There are no random factors in the experimental process that can affect the experimental data and ensure that the experimental process is reproducible.
But the disadvantage of LOO-CV is that the computational cost is high, because the number of models that need to be built is the same as the number of raw data samples, and when the number of raw data samples is quite large, the difficulty of LOO-CV in practice is almost not shown, unless each training classifier gets a fast model, Or you can use parallelization to calculate the time required to reduce the computation.
Paper 35: Cross-validation (crossvalidation) method ideas