Libsvm cross-validation and grid search (parameter selection)

Source: Internet
Author: User

First, cross-validation.

Cross validation is a method of evaluating the generalization ability of statistical analysis and machine learning algorithms for datasets independent from training data (generalize), which can avoid over-fitting problems.
Cross-validation should generally meet the following requirements:

1) The proportion of the training set should be sufficient, generally greater than half.
2) The training and test sets should be uniformly Sampled.

Cross-validation is mainly divided into the following types:
1) Double cross-validation
Double cross-validation, also known as 2-fold cross-validation (2-CV), divides a dataset into two subsets of equal size for two rounds of Classifier Training. In the first round, one subset is used as the training set and the other as the test set. In the second round, after switching the training set to the test set, the classifier is trained again, among them, we are concerned about the recognition rate of the two test sets. However, in practice, 2-CV is not commonly used. The main reason is that the number of samples in the training set is too small, which is usually insufficient to represent the distribution of the parent sample, leading to a significant gap in the recognition rate during the test phase. In addition, the variation of the subset in 2-CV is too large to meet the requirement that the experiment process must be replicated.

2) k-folder cross-validation (k-fold cross verification)
K-fold cross-validation (k-CV) is an extension of Double cross-validation. By dividing a dataset into k subsets, each of which performs a test set, the rest are used as training sets. K-CV cross-verification repeats k times. Each time a subset is selected as the test set, and the average cross-verification recognition rate of k times is used as the result.
Advantage: All samples are used as training sets and test sets, and each sample is verified once. 10-folder is usually used.

3) leave-one-out cross-validation (LOOCV leaves one verification method)
Assume that there are n samples in the dataset, And the LOOCV is n-CV. This means that each sample is used as a test set, and the n-1 samples are used as the training set.
Advantages:
1) almost all samples in each collection are used to train the model. Therefore, it is closest to the parent sample distribution and the estimated generalization error is reliable. Therefore, LOOCV can be used when there are few samples in the experiment dataset.
2) No random factors in the experiment will affect the experiment data and ensure that the experiment process can be copied.
However, the disadvantage of LOOCV is that the computing cost is high. To ensure that the number of models to be created is the same as the total number of samples, it is difficult to implement LOOCV when the total number of samples is quite large, unless model training is fast, or parallel computing can be used to reduce the time required for computing.

Libsvm provides the void svm_cross_validation (const struct svm_problem * prob, const struct svm_parameter * param, int nr_fold, double * target) method. The parameter meanings are as follows:

Prob: The classification problem to be solved is the sample data.

Param: svm training parameters.

Nr_fold: As the name implies, it is k in the k-fold cross-validation. If k = n, it is left with a method.

Target: predicted value. If it is a classification problem, it is a category tag.

 

Then we will discuss the parameter selection.

Parameters must be set for libsvm and svmlight. Taking the RBF core as an example, in the document A Practical Guide to Support Vector Classi cation, the author mentioned that there are two parameters in the RBF core: C and g. For a given problem, we do not know how many values C and g are optimal. Therefore, we need to select a model (parameter search ). The goal is to find a good (C, g) parameter pair so that the classifier can accurately predict unknown data, such as the test set. It should be noted that the pursuit of high accuracy in the training set may be useless (meaning generalization ability ). According to the previous section, cross-validation is required to measure the generalization ability.

In this article, the author recommends "grid search" to find the optimal C and g. The so-called mesh search is to try a variety of possible (C, g) pairs, and then perform cross verification to find the (C, g) pairs that make the cross verification accuracy the highest. The "grid search" method is intuitive but looks a little primitive. In fact, there are many advanced algorithms, such as some approximate algorithms or heuristic search, to reduce complexity. However, we tend to use the simple method of "grid search:

1) psychologically, using approximation algorithms or heuristic algorithms makes people feel insecure without comprehensive parameter search.

2) If the number of parameters is small, the complexity of "grid search" is not much higher than that of advanced algorithms.

3) "grid search" is highly parallel, because each (C, g) pair is independent of each other.

After talking about this for a long time, in fact, "grid search" is a n-layer loop, n is the number of parameters, and the RBF core is still used as an example. The programming implementation is as follows:

For (double c = c_begin; c <c_end; c ++ = c_step)
{
For (double g = g_begin; g <g_end; g + = g_step)
{

// Perform cross-validation here to calculate the accuracy.

}

}

Find the optimal C and g through the above two-layer loop.

Appendix:

Common mistakes made by Cross-Validation

Many laboratory studies have applied evolutionary algorithms (EA) and classifiers. The fitness function used usually has the classifier recognition rate, however, there are still many cases where cross-validation is used incorrectly. As mentioned above, only training data can be used for model Construction. Therefore, only the recognition rate of training data can be used in fitness function. EA is the method used to adjust the optimal parameters of the model during the training process. Therefore, test data can be used only when the model parameter is fixed after the EA ends evolution. (Of course, if you want to fake it, you should include the data in the test set in model training. This will improve the model performance, because the model itself already contains the prior knowledge of the test set, the test set is no longer unknown data .)

How can EA and cross-validation work together? The essence of Cross-validation is to estimate (estimate) the generalization error of a classification method on a set of dataset, rather than the classifier design method, therefore, cross-validation cannot be used in EA's fitness function. Because the samples related to fitness function belong to the training set, which samples are test sets? If a fitness function uses the training or test recognition rate of cross-validation, such an experimental method cannot be called cross-validation.

The correct combination method of EA and k-CV is to divide dataset into subsets of k equal portions, take one subset each time as test set, and the other K-1 as training set, apply the training set to the fitness function compute of EA (there is no limit on how to further use the training set ). Therefore, the correct k-CV performs k times of EA evolution and establishes k classifiers. The test recognition rate of k-CV is the average value of the k classifiers recognition rate corresponding to the k test sets obtained by EA training.

Reference link:

Html> http://blog.sina.com.cn/s/blog_4998f4be0100awon.html

Http://www.shamoxia.com/html/y2010/2245.html

Http://fuliang.javaeye.com/blog/769440

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.