1. Principle
1.1 Concepts
Cross-validation (cross-validation) is primarily used in model training or modeling applications such as categorical prediction, PCR, pls regression modeling, and so on. In the given sample space, most of the samples are taken as training sets to train the model, and the remainder of the sample is predicted using the newly established model, and the prediction error or prediction accuracy of the small sample is calculated, and their addition and averaging are recorded. This process iterates over K-times, which is the K-fold crossover. wherein, the predicted error of each sample squared plus sum, called the press (predicted error Sum of squares).
Training set vs. test sets the data set (dataset) is often divided into training sets (training) in the study of pattern recognition (patterns recognition) and machine learning (machines learning). Set) is a subset of the test set (testing set), which is used to build the model, and the latter to evaluate the accuracy of the model for predicting unknown samples, and the formal argument is generalization (generalization ability). How to divide a complete data set into a training set and a test set must follow the following key points:
1, only the training set can be used in the training of the model, the test set must be used after the model is completed to evaluate the model of the merits of the basis.
2, the number of samples in the training set must be enough, generally at least greater than 50% of the total number of samples.
3, two sets of subsets must be sampled evenly from the full set.
The last point is particularly important, and uniform sampling is intended to reduce the deviation (bias) between the training set/test set and the complete set, but it is not easy to do so. The general practice is random sampling, when the sample quantity is sufficient, can achieve uniform sampling effect, but the random is also the blind spot of this practice, but also often can be in the data to tamper with the place. For example, when the identification rate is not ideal, then re-sampling a set of training sets/test sets until the test set is satisfied with the recognition rate, but strictly speaking it is cheating.
1.2 purpose
The purpose of cross-validation is to obtain a reliable and stable model. A very important factor in classifying, building a PC or PLS model is how many principal components are taken. Using cross validation to verify the press value under each principal component, select the main score with a small value of the pressing. or the press value no longer changes to the main score of the hour.
commonly used precision test methods are mainly cross-validation, such as 10 percent cross-validation (10-fold validation), the data set divided into 10 parts, 9 of them to do 1 in turn to verify, 10 times the mean value of the results as an estimate of the accuracy of the algorithm, It is also common to perform multiple 10 percent cross-validation averaging, for example: 10 times 10 percent cross-validation for a more accurate point. Cross-validation is also sometimes referred to as cross-alignment, such as: 10 percent cross-validation for
1.3 Common cross-authentication forms:
Holdout verification
Methods: The original data was randomly divided into two groups, one set as a training set, a set as a validation set, training classifier using training set, and then using validation set verification model, record the final classification accuracy rate for this Hold-outmethod classifier performance index. Hold-outmethod is also called double cross-validation relative to K-fold cross Validation, or K-CV 2-fold (cross-validation) relative to 2-CV
In general, holdout validation is not a cross-validation because the data is not cross-used. Randomly select parts from the initial sample to form cross-validation data, and the remainder as training data. In general, data that is less than one-third of the original sample is selected for validation data.
Pros: The benefits are simple to handle, just randomly dividing the original data into two groups
Disadvantage: Strictly speaking, hold-out method is not a CV, because this method does not meet the idea of crossover, because it is randomly grouped raw data, so the final verification set classification accuracy of the high and low and the original data grouping has a great relationship, So the results of this approach are not persuasive. (The main reason is that the training set sample number is too small, usually not enough to represent the distribution of maternal samples, resulting in the test phase identification rate prone to a significant gap.) In addition, the variation of the molecular set method in 2-CV in the middle of the split is too large to meet the requirement that the experimental process must be reproducible. )
K-fold cross-validation
K-fold cross-validation, initial sampling is divided into K sub-samples, a separate sub-sample is retained as the validation model of the data, the other K-1 samples for training. Cross-validation repeats k times, each sub-sample is validated once, the results of the average K or other combinations are used, resulting in a single estimate. The advantage of this approach is that, at the same time, repeated use of randomly generated sub-samples for training and validation, each time the results are validated once, 10 percent cross-validation is the most common.
Advantages: K-CV can effectively avoid the occurrence of learning and lack of learning, and finally the results are more persuasive.
Disadvantage: The K value is selected on
Leave a verification
As the name suggests, leaving a validation (LOOCV) means that only one of the original samples is used as the validation data, and the remainder is left as the training material. This step continues until each sample is treated as a validation data. In fact, this is equivalent to K-fold cross-validation, where K is the original number of samples. In some cases there are efficient algorithms, such as the use of kernel regression and Tikhonov regularization.
2. In-depth
There are 3 main purposes of using the cross-validation method:
(1) To obtain as much effective information as possible from the limited learning data;
(2) Cross-validation from a number of directions to start learning samples, can effectively avoid falling into the local minimum value;
(3) Can avoid fitting the problem to some extent.
The cross-validation approach involves dividing the learning data sample into two parts: training data samples and validation data samples. And in order to get a better learning effect, both training samples and validation samples should be involved in learning as much as possible. Generally choose 10 heavy cross-validation can achieve good learning effect. The following principles are based on the design of the algorithm, mainly describes the following algorithm steps, as shown below.
Algorithm
STEP1: The study sample space C is divided into equal-sized K-parts
Step2: for i = 1
to K: Take the first part as the test set for
j = 1 to K:
if I! = J:
will be part J Add to the training set as part of the training sets
End If
end for end-
Step3: for I in (K-1 training set):
Training The I training set, getting a classification model
Use this model to test on the nth data set, calculate and save the model evaluation indicator
end for
Step4: calculate the average performance of the model
STEP5: Using this K model in the final validation set the classification accuracy average as the performance index of this K-CV classifier.
3, the use of cross-validation often made mistakes
Since many studies in the laboratory are useful to evolutionary algorithms (EA) and classifiers, the fitness function used is usually useful to the classifier identification rate, However, there are many cases of cross-validation using the wrong case. As mentioned earlier, only training data can be used for model construction, so only the training data identification rate can be used in fitness function. The EA is the method used by the training process to adjust the model's best parameters, so the test data can only be used when the model parameter is fixed after the EA has finished evolving. How to match the EA with cross-validation? The essence of cross-validation is to estimate the generalization error (estimate) of a classification method against a set of datasets, not to design classifier, so C Ross-validation cannot be used in the EA's fitness function, because the fitness function-related samples belong to the training set, then which sample is the test set? If a fitness function uses the Cross-validation training or test identification rate, then such an experimental method is not known as cross-validation.
EA and K-CV the correct collocation method, is to divide the dataset into K equal parts of the subsets, each fetch 1 copies of subset as Test set, the remaining k-1 as training Set, and applies the set of training set to the EA's fitness function calculation (there is no limit as to how the training set can be further exploited). Therefore, the correct K-CV will be a total of K-time EA evolution, the establishment of K-classifiers. The test identification rate of the K-CV is the average of the K-classifiers identification rate corresponding to the EA training obtained by the K-Group test sets.