First, cross-validation.
Cross-validation (validation) is an evaluation of statistical analysis, machine learning algorithms for data sets independent of the training data generalization ability (generalize), can avoid overfitting problems.
Cross-validation generally needs to be as satisfying as possible:
1) The proportion of the training set should be enough, generally more than half
2) uniform sampling of training set and test set
Cross-validation is mainly divided into the following categories:
1) Double Cross-validation
Double Cross-validation is also known as 2-fold cross-validation (2-CV) by dividing the data set into two equal-sized subsets for two rounds of classifier training. In the first round, a subset is used as the training set and the other as the test set, and in the second round the classifier is trained again after swapping the training set with the test set, where we are more concerned with the recognition rate of the two test sets. However, in practice, 2-CV is not commonly used, the main reason is that the training set sample number is too small, usually not enough to represent the distribution of maternal samples, resulting in the test phase recognition rate prone to a significant gap. In addition, the variability of subsets in 2-CV often fails to meet the requirement that "the experimental process must be reproducible".
2) K-folder cross-validation (K-fold cross-validation)
K-fold cross-validation (K-CV) is an extension of the double cross-validation by dividing the dataset into k subsets, each subset being tested and the rest being the training set. K-CV cross-validation repeats k times, selects a subset each time as the test set, and averages the K-th average cross-validation recognition rate as the result.
Pros: All samples are used as training sets and test sets, and each sample is validated once. 10-folder are usually used.
3) Leave-one-out cross-validation (LOOCV to leave a verification method)
Assuming that there are n samples in the dataset, that LOOCV is N-CV, which means that each sample is a single test set, and the remaining n-1 samples are the training set.
Advantages:
1) Almost all of the samples in each turn are used to train the model, so it is closest to the distribution of the parent sample and the estimated generalization error is more reliable. Therefore, you may consider using LOOCV when there are few samples of the experimental data set.
2) No random factors can affect the experimental data and ensure that the experimental process can be reproduced.
However, the disadvantage of LOOCV is that the calculation cost is high, the number of models to be established is the same as the total sample quantity, when the total sample quantity is quite long, the LOOCV is difficult in practice, unless the speed of each training model is very fast, or the time required to reduce the calculation can be calculated by parallelization.
LIBSVM provides void svm_cross_validation (const struct Svm_problem *prob, const struct svm_parameter *param, int nr_fold, double *target) method, the parameters have the following meanings:
Prob: The classification problem to be solved is the sample data.
PARAM:SVM training parameters.
Nr_fold: As the name implies K-fold cross-validation of K, if the k=n is to stay a law.
Target: The predictive value, if it's a classification problem, is the category tag.
Then we discuss the next parameter selection.
With SVM, the parameters need to be set, either LIBSVM or Svmlight. Taking the RBF nucleus as an example, the author mentions in the article "A Practical guide-to-support Vector Classi cation" that there are 2 parameters in the RBF nucleus: C and G. For a given problem, we don't know in advance how much C and g are optimal, so we're going to choose the model (parametric search). The goal is to find good (C, g) parameter pairs so that the classifier can accurately predict unknown data, such as a test set. It is important to note that the pursuit of high accuracy in the training set may be useless (meaning generalization). As stated in the previous section, cross-validation is used to measure generalization capacity.
In the article, authors recommend using "Grid search" to find the best C and G. The so-called grid search is to try various possible (c, g) pairs of values, and then cross-validate to find the most accurate cross-validation (c, G) pair. The "grid search" approach is straightforward but looks primitive. In fact there are many advanced algorithms, such as the ability to use some approximate algorithms or heuristic search to reduce complexity. But we tend to use the simple method of "grid search":
1) Psychologically speaking, the use of approximate or heuristic algorithms to make people feel insecure is not a comprehensive parametric search.
2) If the parameters are relatively small, the complexity of "grid search" is not much higher than the advanced algorithm.
3) "Grid search" is highly parallelized because each (C, g) pair is independent of each other.
Said so big half a day, in fact, "grid search" is the N-layer loop, n is the number of parameters, still take the RBF kernel as an example, the program is implemented as follows:
for (double c=c_begin;c<c_end;c+=c_step)
{
for (double g=g_begin;g<g_end;g+=g_step)
{
Cross-validation is performed here, and the accuracy is calculated.
}
}
It is possible to find the optimal C and G through the above two-layer cycle.
Appendix:
Mistakes that are often made using cross-validation
Since many studies in the laboratory are useful to evolutionary algorithms (EA) and classifiers, the fitness function used is usually useful to the classifier identification rate, However, there are many cases of cross-validation using the wrong case. As mentioned earlier, only training data can be used for model construction, so only the training data identification rate can be used in fitness function. The EA is the method used by the training process to adjust the model's best parameters, so the test data can only be used when the model parameter is fixed after the EA has finished evolving. (Of course, if you want to fake the test set of data to participate in the model training, so that the model effect will be better, because the model itself already contains the test set of prior knowledge, test set for it is no longer unknown.) )
How does the EA match with the cross-validation? The essence of cross-validation is to estimate the generalization error (estimate) of a classification method against a set of datasets, not to design classifier, So cross-validation can not be used in the EA's fitness function, because the fitness function-related samples belong to the training set, then ask which sample is the test set? If a fitness function uses the Cross-validation training or test identification rate, then such an experimental method is not known as cross-validation.
The correct collocation method for EA and K-CV is to divide the dataset into subsets of K equal parts, take 1 copies of subset as Test set, the remaining k-1 as training set, and apply the group training set to the EA's Fitness In the function calculation (there is no limit as to how the training set can be further exploited). Therefore, the correct K-CV will be a total of K-time EA Evolution, the establishment of K-classifiers. The test identification rate of the K-CV is the average of the K-classifiers identification rate corresponding to the EA training obtained by the K-Group test sets.
LIBSVM cross-validation and grid search (parametric selection)