A little note from cross validation

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Our data are a sample,and what we need are to resample, Cross-validation is a resampling method.

The lower the training error, the test error can get higher if we over fit.

Model Complexity:for example, in Alinear Model:it are the number of features, or the number of coefficients that we fit I n the model. Low means a few number of features or predictors. High means a large numbers.

For example fitting a model in a polynomial with higher and higher degree

As can be seen from the figure, as the model complexity increases, training error becomes smaller, because the resulting model is more and more consistent with training data, and so Testerror will be smaller to a minimum value, is beginning to increase due to overfitting, this diagram also explains the Overfit well.

Bias and variance are predictor error. So, the bias is what far off on the average the model was from the truth. The variance is how much, the estimate varies around its average. That is, when we are not fit very hard, then the farther away from truth, the greater bias, while variance smaller, because this time the number of features less.

In order to find a suitable point, the appropriate degree or the appropriate model complexity, this time the cross validation method is more useful.

When our data sets compare hours, we use the Cross-validation method.

First, take out some of the data, then use the rest of the data as training, and then test the fit out model against the extracted data.

For example, this picture is twofold validation, divides the data into two parts, the blue part training, the orange part test.

As can be seen from the figure, when the degree is greater than or equal to 2, the MSE is small, the right figure shows the disadvantage of twofold cross validation: Each time, we re-select two parts as training and test, although the shape of each result graph is roughly the same, However, you can see that the MSE varies from 16 to 24 and can be found to be variability large. So how many fold is more appropriate. K=5 to 10 more appropriate.

3. K-fold cross-validation

The data is divided into k parts, some of which are validation set, and the rest is the K-1 part is the training set.

For example, the first part as validation set, the remaining four parts fit together the model, then the model to test the validation set, then the second part as validation set, and then do the same steps, And so on a total of 5 times.

This figure can see two points: first, when the degree equals 2, the MSE is small and 10 times the point is basically coincident, this solves the problem of model complexity, and second, when k=10, that is 10fold, variability very small, 10 times the basic consistency, and the previous twofold diagram comparison, proved that 5-10 fold more appropriate.

(1) Since each training set was only (K-1)/k as big as the original training set, the estimates ofprediction error would typ Ically be biased upward.

(2) This bias was minimized when K = N (LOOCV), but this estimate have high variance, as noted earlier.

(3) K = 5 or provides a good compromise for this bias-variance tradeoff.

LOOCV r code: (leave-one-out Cross validation)

Require (ISLR)
Require (boot)
CV.GLM #cv. GLM is the Cross Validation command, can you use CV.GLM to view detailed usage?
Plot (Mpg~horsepower,data=auto)

# # LOOCV
GLM.FIT=GLM (Mpg~horsepower, Data=auto)
CV.GLM (auto,glm.fit) $delta #pretty slow (doesnt use formula (5.2) on page) Delta is prediction error

# #Lets Write a simple function-to-use formula (5.2)
Loocv=function (FIT) {
H=lm.influence (FIT) $h
Mean ((residuals (FIT)/(1-h)) ^2)
}

# Now we try it out
LOOCV (Glm.fit)

Cv.error=rep (0,5)
Degree=1:5
For (d in degree) {
GLM.FIT=GLM (Mpg~poly (horsepower,d), Data=auto)
CV.ERROR[D]=LOOCV (Glm.fit)
}
Plot (degree,cv.error,type= "B")

# # 10-fold CV

Cv.error10=rep (0,5)
For (d in degree) {
GLM.FIT=GLM (Mpg~poly (horsepower,d), Data=auto)
CV.ERROR10[D]=CV.GLM (auto,glm.fit,k=10) $delta [1]
}
Lines (degree,cv.error10,type= "B", col= "Red")

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A little note from cross validation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A little note from cross validation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support