The way of Big data processing (experimental method < two >)

Source: Internet
Author: User

One: cross-validation (crossvalidation) (three methods attached to the experiment) introduction to the method   

(1) definition: cross-validation (cross-validation) is primarily used in modeling applications such as PCR(Principal Component Regression), PLS(Partial least squares regression) in regression modeling. In a given modeling sample, most of the samples are modeled, a small sample is used to predict the model, and the prediction error of the small sample is obtained, and the sum of squares is recorded. This process continues until all the samples have been predicted once and are only predicted once. The prediction error of each sample is squared plus, called the Press (predicted error Sum of squares). "From wikipidia:https://zh.wikipedia.org/wiki/%E4%BA%A4%E5%8F%89%E9%A9%97%E8%AD%89#K-fold_ Cross-validation"

(2) Classification:

Cross-validation is generally divided into three categories: thedouble-fold CV is often referred to as the 2 -fold crossover;10-fold crossover and LOO(leaveoneout) CV is one-way intersection.


2 fold: The original dataset is divided into two parts: one as a training set, namely Trainingset, one as a test set, that is Testingset , then use the training set to do the training, with the test set to verify, and then the training set as a test set, the test set as a training set to iterate, the error of two times as a result of the overall data prediction error. (Note: It is emphasized here that the data set must be divided into two parts, the reason is: as a training set, the amount of data must not be less than the test set, so in the iterative process, so that the data does not appear error conditions, must be evenly split. )

k - fold: (in this case K-fold) is to divide the data set into K subsets,K is a subset of the test set, and the rest of the K-1 data Set as a training set, and finally the K The error calculation mean of the subset of numbers,K -Iteration verification is the method to evaluate the results of the supervised learning algorithm, and the partition of data sets is generally divided by equal or random.


LOO : This method is a special column of K -fold, that is, the data is divided into n , in fact, each part is a sample, so iterative N times, calculate the final error as the prediction error.


(3) measurement method: At the end of the above cross-validation mentioned the data error, because there is no validation once there is a data error, after the K -fold verification, to iterate K times, this error processing also has different methods, that is, the measurement method, such as you take the average ME, or both, can be, and the average standard error, etc., can be used as the final verification error.

(4) Below is 3-fold cross validation


Second, the experimental data processing method


(1) verification: Refers to the data into two parts, part as a training set, that is, Trainingset, part of the test set, that is, Testingset, the former accounted for the data 70%, the latter accounted for the data of the 30%, with training set to train, with the test set for testing, the final error as the overall prediction error.
(Note: Here you have to pay attention to the difference between the 2 -fold cross-validation, mainly in the data set of the sub-and test)

(2) Cross-validation: This is not described here, is to use the results of cross-validation as the criteria for the selection of parameters, but also as a model of merit criteria.

(3) experiment Three "from the introduction of machine learning" here first divides the data into two parts, part as the training set, part as the test set, uses the training set to cross-verify chooses the optimal parameter, uses the test set to select the optimal model. Of course, the final error is also a test set to be used as a prediction error.


The above is three kinds of experimental treatment methods, correct which method is good, can not do any comment, can only say that suits your appetite, or your professional appetite, or spectator's appetite ...

The way of Big data processing (experimental method < two >)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.