A little note from cross validation

Source: Internet
Author: User

1.

Our data are a sample,and what we need are to resample, Cross-validation is a resampling method.

The lower the training error, the test error can get higher if we over fit.

Model Complexity:for example, in Alinear Model:it are the number of features, or the number of coefficients that we fit I n the model. Low means a few number of features or predictors. High means a large numbers.

For example fitting a model in a polynomial with higher and higher degree

As can be seen from the figure, as the model complexity increases, training error becomes smaller, because the resulting model is more and more consistent with training data, and so Testerror will be smaller to a minimum value, is beginning to increase due to overfitting, this diagram also explains the Overfit well.

Bias and variance are predictor error. So, the bias is what far off on the average the model was from the truth. The variance is how much, the estimate varies around its average. That is, when we are not fit very hard, then the farther away from truth, the greater bias, while variance smaller, because this time the number of features less.

In order to find a suitable point, the appropriate degree or the appropriate model complexity, this time the cross validation method is more useful.

2.

When our data sets compare hours, we use the Cross-validation method.

First, take out some of the data, then use the rest of the data as training, and then test the fit out model against the extracted data.


For example, this picture is twofold validation, divides the data into two parts, the blue part training, the orange part test.


As can be seen from the figure, when the degree is greater than or equal to 2, the MSE is small, the right figure shows the disadvantage of twofold cross validation: Each time, we re-select two parts as training and test, although the shape of each result graph is roughly the same, However, you can see that the MSE varies from 16 to 24 and can be found to be variability large. So how many fold is more appropriate. K=5 to 10 more appropriate.

3. K-fold cross-validation

The data is divided into k parts, some of which are validation set, and the rest is the K-1 part is the training set.



For example, the first part as validation set, the remaining four parts fit together the model, then the model to test the validation set, then the second part as validation set, and then do the same steps, And so on a total of 5 times.



This figure can see two points: first, when the degree equals 2, the MSE is small and 10 times the point is basically coincident, this solves the problem of model complexity, and second, when k=10, that is 10fold, variability very small, 10 times the basic consistency, and the previous twofold diagram comparison, proved that 5-10 fold more appropriate.

(1) Since each training set was only (K-1)/k as big as the original training set, the estimates ofprediction error would typ Ically be biased upward.

(2) This bias was minimized when K = N (LOOCV), but this estimate have high variance, as noted earlier.

(3) K = 5 or provides a good compromise for this bias-variance tradeoff.

LOOCV r code: (leave-one-out Cross validation)

Require (ISLR)
Require (boot)
CV.GLM #cv. GLM is the Cross Validation command, can you use CV.GLM to view detailed usage?
Plot (Mpg~horsepower,data=auto)


# # LOOCV
GLM.FIT=GLM (Mpg~horsepower, Data=auto)
CV.GLM (auto,glm.fit) $delta #pretty slow (doesnt use formula (5.2) on page) Delta is prediction error


# #Lets Write a simple function-to-use formula (5.2)
Loocv=function (FIT) {
H=lm.influence (FIT) $h
Mean ((residuals (FIT)/(1-h)) ^2)
}

# Now we try it out
LOOCV (Glm.fit)


Cv.error=rep (0,5)
Degree=1:5
For (d in degree) {
GLM.FIT=GLM (Mpg~poly (horsepower,d), Data=auto)
CV.ERROR[D]=LOOCV (Glm.fit)
}
Plot (degree,cv.error,type= "B")


# # 10-fold CV


Cv.error10=rep (0,5)
For (d in degree) {
GLM.FIT=GLM (Mpg~poly (horsepower,d), Data=auto)
CV.ERROR10[D]=CV.GLM (auto,glm.fit,k=10) $delta [1]
}
Lines (degree,cv.error10,type= "B", col= "Red")

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.