K-Layer cross-examination is to randomly divide the original data into K-parts. In the K section, choose one as the test data, the remaining K-1 as the training data.
The process of cross-examination is actually to repeat the experiment K times, each experiment from the K part of the selection of a different part of the test data (to ensure that the K-part of the data are tested separately), the rest of the K-1 as training data for the experiment, and finally the results of the k experiments averaged.
Http://www.ilovematlab.cn/thread-49143-1-1.html
Introduction to the idea of cross-validation (crossvalidation) method
The following abbreviation for cross-validation (crosses Validation) is CV. CV is used to verify the performance of classifiers a statistical analysis method , the basic idea is to put the original data (dataset) in a sense of grouping , part as a training set (train set), another part as a validation set (validation set), first training the classifier with the training set , using the validation set to test the trained model , as the Performance index of the evaluation classifier . Common the CV method is as follows :1). Hold-out Method
The raw data is randomly divided into two groups , one set as the training set , one set as the validation set , the training classifier is trained with the train set , and then the validation set is used to validate the model . Record the final classification accuracy rate for this Hold-outmethod under the classifier's performance index . The benefits of this method of processing simple , just randomly divided the original data into two groups , in fact, strictly speaking , Hold-out Method is not a CV, Because this method does not achieve the idea of crossover , because it is randomly grouped raw data , so the final verification set classification accuracy of the high and low and the original data grouping has a great relationship, So the results of this approach are not persuasive .
2). K-fold Cross Validation ( recorded as K-CV)
divide the raw data into K -Groups ( usually evenly divided), each subset of data to do a validation set , the rest of the K-1 Group subset of data as a training set , so that you can get a K model , The average of the classification accuracy of the final verification set of K model is used as the Performance index of this K-CV classifier. K is generally greater than or equal to 2, the actual operation is generally starting from 3 to take , only when the raw data collection of small amount of time will try to fetch 2.K-CV Can effectively avoid the occurrence of learning and lack of learning , the final results are also more persuasive .
3). Leave-one-out Cross Validation ( recorded as LOO-CV)
if the original data is set N Samples , then LOO-CV is N-CV, that is, each sample as a validation set , the remaining N-1 samples as the training set , so LOO-CV will get n models , using the average of the classification accuracy of the final validation set of the N model as the performance index of the lower LOO-CV classifier. . compared to the previous the K-CV,LOO-CV has two distinct advantages :
①
A. Almost all of the samples in each round are used to train the model , so it is closest to the distribution of the original sample , which results in a more reliable evaluation.
②
B. There are no random factors in the experimental process that can affect the experimental data and ensure that the experimental process is reproducible.
But the disadvantage of LOO-CV is that it is computationally expensive because the number of models that need to be built is the same as the number of raw data samples, and when the raw data sample count is quite large , LOO-CV In practice, the difficulty is almost not shown , except that each time the training classifier gets the model quickly , or it can use parallelization to calculate the time required to reduce the computation .If you understand K-fold cross validation, this is about the same as what it means. K-fold, is to take the whole sample of 1/k as a predictive sample, (K-1)/k as a training sample. When a training sample is used to model the data, a predictive sample is used to predict it.
Leave-one-out is the N-1 sample as a training set, leaving a sample as a predictive set. and loop so that each sample acts as a prediction set, and then calculates the correct rate of cross-validation. http://blog.xuite.net/x5super/studyroom/61471385-%E4%B8%80%E7%AF%87%E5%BE%88%E6%A3%92%E7%9A%84%E6%B8%AC%E8 %a9%a6%28%e5%9b%9e%e6%b8%ac%29%e6%8a%80%e8%a1%93%e6%96%87%e7%ab%a0
Application of Big data in education Part2 notes