[Turn] about experimental validation

Source: Internet
Author: User

See: http://blog.sciencenet.cn/home.php?mod=space&uid=830496&do=blog&id=676326

cross-validation (cross-validation): sometimes also called cyclic estimation, is a statistically useful method of cutting data samples into smaller subsets. The analysis can then be done on a subset, while the other subsets are used for subsequent validation and verification of this analysis. A subset of the beginning is called a training set. Other subsets are referred to as validation sets or test sets.

Cross-validation has strong guidance and verification significance for artificial intelligence, machine learning, pattern recognition, classifier and so on.
The basic idea is to group the original data (dataset) in a certain sense, part as the training set (train set) and the other part as the validation set (validation set or test set), first training the classifier with the training set, Using the validation set to test the training obtained model, as the evaluation of the classifier performance indicators.

The three major CV methods
1). Hold-out Method

    • methods: The original data was randomly divided into two groups, one set as a training set, a set as a validation set, training classifier using training set, and then using validation set verification model, record the final classification accuracy rate for this Hold-outmethod classifier performance index. Hold-outmethod is also called double cross-validation relative to K-fold cross Validation, or K-CV 2-fold (cross-validation) relative to 2-CV
    • Pros: The benefits are simple to handle, just randomly dividing the original data into two groups
    • disadvantage: strictly speaking, hold-out method is not a CV, because this method does not meet the idea of crossover, because it is randomly grouped raw data, so the final verification set classification accuracy of the high and low and the original data grouping has a great relationship, So the results of this approach are not persuasive. (The main reason is that the training set sample number is too small, usually not enough to represent the distribution of maternal samples, resulting in the test phase identification rate prone to a significant gap.) In addition, the variation of the molecular set method in 2-CV in the middle of the split is too large to meet the requirement that the experimental process must be reproducible. )

2). K-fold Cross Validation (recorded as K-CV)

    • method: as 1 , the original data is divided into K-groups (generally, evenly), each subset of data is made a validation set, the rest of the K-1 group subset of data as a training set, which will get K models, The average of the classification accuracy of the final validation set of the K model is used as the performance index of this K-CV classifier. k is generally greater than or equal to 2, the actual operation is generally starting from 3 to take, only when the original data collection data volume is small, will try to take 2. In the K-CV experiment, we need to establish K-models, and calculate the average recognition rate of K test sets. In practice, K is big enough to make the number of training samples in each round enough, generally k=10 (as an empirical parameter) is quite adequate.

A 5-fold Cross Validation method

    • Advantages: K-CV can effectively avoid the occurrence of learning and lack of learning, and finally the results are more persuasive.
    • Disadvantages: K Value selection on

3). Leave-one-out Cross Validation (recorded as LOO-CV)

    • method: If the original data has n samples, then LOO-CV is N-CV, that is, each sample as a validation set, the remaining N-1 samples as a training set, so LOO-CV will get n models, The average of the classification accuracy of the final verification set of these n models is used as the performance index of the lower LOO-CV classifier.
    • Advantages: There are two obvious advantages compared to the previous K-CV,LOO-CV:A.   Almost all of the samples in each turn are used to train the model, so it is closest to the distribution of the original sample, which results in a more reliable evaluation. B. there are no random factors in the experimental process that can affect the experimental data and ensure that the experimental process is reproducible.
    • Cons: The computational cost is high because the number of models that need to be built is the same as the number of raw data samples, and when the raw data sample number is quite large, the LOO-CV is difficult to do in practice, except that every time the training classifier gets the model fast, Or you can use parallelization to calculate the time required to reduce the computation.

In the research of pattern recognition and machine learning, the data set is often divided into two subsets, the training set and the test set, the former is used to establish the pattern, the latter is to evaluate the accuracy of the pattern to predict the unknown sample, the formal argument is generalization ability (generalization ability)

Cross-validation Core principles
Cross-validation is designed to effectively estimate the experimental method of generalization error.

Only the training set can be used in the training process of the pattern, the test set must be used to evaluate the model after the pattern is completed.

  • Common Error Application: Many people in the study are useful to evolutionary algorithms (EA, Genetic algorithm) and classifiers, the Fitness function (fitness function) used is usually useful to Classifier of the recognition rate, but the cross-validation use the wrong case a lot. As mentioned earlier, only training data can be used for model construction, so only the training data identification rate can be used in fitness function. The EA is the method used by the training process to adjust the model's best parameters, so the test data can only be used when the model parameter is fixed after the EA has finished evolving.
  • EA and CV combined research methods: The essence of cross-validation is to estimate the generalization error of a classification method against a set of datasets, not to design classifier, so Cross-valid Ation cannot be used in the EA's fitness function, because the fitness function-related samples belong to the training set, then which sample is the test set? If a fitness function uses the Cross-validation training or test identification rate, then such an experimental method is not known as cross-validation.
  • EA and K-cv correct collocation method: is to divide the dataset into K equal parts subsets, each fetch 1 copies of subset as Test set, the remaining k-1 as training set, and the group training Set applies to the fitness function calculation of the EA (there is no limit as to how the training set can be further exploited). Therefore, the correct K-CV will be a total of K-time EA evolution, the establishment of K-classifiers. The test identification rate of the K-CV is the average of the K-classifiers identification rate corresponding to the EA training obtained by the K-Group test sets.

Data Set Segmentation principles
Cross-validation where the original dataset is split into a training set and a test set, there are two key points to follow:

  1. The number of samples in the training set must be large enough to be at least 50% more than the total number of samples.
  2. Two sets of subsets must be sampled evenly from the full set.

The first 2 points are particularly important, and uniform sampling is intended to reduce the deviation (bias) between the training set/test set and the complete set, but it is not easy to do so. The general practice is to randomly sample, and when the sample quantity is sufficient, the effect of uniform sampling can be achieved. But randomness is also a blind spot for this practice, and is often a place where data can be rigged. For example, when the recognition rate is not ideal, a set of training sets and test sets will be re-sampled until the test set is satisfied with the identification rate, but it is strictly a cheat.

See: http://blog.sciencenet.cn/home.php?mod=space&uid=830496&do=blog&id=676326

Http://hi.baidu.com/872911713/blog/item/d9420ff6767cbf0fb07ec5f4.html

Http://blog.sina.com.cn/s/blog_688077cf0100zqpj.html

[Turn] about experimental validation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.