8.6 Choosing the "Best" regression model
Comparison of 8.6.1 Models
You can compare the goodness of fit for two nested models with the ANOVA () function in the base installation. The so-called nested model, which is one of its
Items are completely contained in another model
Using the ANOVA () function to compare
> States<-as.data.frame (State.x77[,c ("Murder", "Population", "illiteracy", "Income", "Frost")])
> Fit1<-lm (murder~population+illiteracy+income+frost,data=states)
>FIT2<-LM (Murder~population+illiteracy,data=states)
> Anova (FIT2,FIT1)
Analysis of Variance Table
Model 1:murder ~ Population + Illiteracy
Model 2:murder ~ Population + illiteracy + Income +frost
RES.DF RSS Df Sum of Sq F Pr (>f)
1 47289.25
2 45289.17 2 0.078505 0.0061 0.9939
AIC (akaikeinformation Criterion, Red Pool information guidelines) can also be used to compare models, which takes into account the model's
Statistical fitting and the number of parameters to fit. A model with a smaller AIC value is preferred, which shows that the model uses fewer parameters
Get enough fit.
> AIC (FIT1,FIT2)
DF AIC
Fit1 6 241.6429
Fit2 4 237.6565
8.6.2 Variable Selection
1. Stepwise regression stepwise method
In stepwise regression, the model adds or deletes a variable at a time until a certain stop criterion is reached. Forward
Stepwise regression (forward stepwise) adds a predictor variable to the model each time, until the add variable does not change the model
Input. Backward stepwise regression (backward stepwise) starts with all predictor variables in the model, deleting one variable at a time
Until the model quality is reduced. And forward-backward stepwise regression (stepwise stepwise, often called stepwise regression
), combining the methods of forward stepwise regression and backward stepwise regression, the variables enter one at a time, but each step
, variables are re-evaluated, variables that do not contribute to the model will be deleted, and predictor variables may be added and deleted.
Several times until the optimal model is obtained. The Stepaic () function in the mass package can be implemented
The stepwise regression model (forward, backward, and forward backwards) is based on the precise AIC guidelines.
> Library (MASS)
>FIT1<-LM (Murder~population+illiteracy+income+frost,data=states)
>stepaic (fit1,direction= "backward")
start:aic=97.75
Murder ~ Population +illiteracy + Income + Frost
Df Sum of Sq RSS AIC
-Frost 1 0.021 289.19 95.753
-Income 1 0.057 289.22 95.759
<none> 289.17 97.749
-Population 1 39.238 328.41 102.111
-Illiteracy 1 144.264 433.43 115.986
step:aic=95.75
Murder ~ Population +illiteracy + Income
Df Sum of Sq RSS AIC
-Income 1 0.057 289.25 93.763
<none> 289.19 95.753
-Population 1 43.658332.85 100.783
-Illiteracy 1 236.196 525.38 123.605
step:aic=93.76
Murder ~ Population +illiteracy
Df Sum of Sq RSS AIC
<none> 289.25 93.763
-Population 1 48.517 337.76 99.516
-Illiteracy 1 299.646588.89 127.311
Call:
LM (formula = Murder ~population + illiteracy, data = states)
Coefficients:
(Intercept) Population Illiteracy
1.6515497 0.0002242 4.0807366
2. Full subset Regression
Full subset regression can be implemented using the Regsubsets () function in the leaps package. You can pass r squared, adjust r squared, or
Mallows CP Statistics and other criteria to select the "best" model
> Library ("Leaps", lib.loc= "D:/programfiles/r/r-3.1.3/library")
>leaps<-regsubsets (murder~population+illiteracy+income+frost,data=states,nbest=4)
> Plot (leaps,scal= "ADJR2")
> Library (CAR)
> Subsets (leaps,statistic= "CP", main= "Cpplot for all subsets regression")
> Abline (1,1,lty=2,col= "Red")
8.7 Deep Analysis
8.7.1 Cross-validation
The so-called cross-validation, a certain proportion of the data is selected as training samples, additional samples for the retention of samples, first in
The regression equation is obtained on the training sample, and then the prediction is made on the retention sample. Since the retention sample does not involve the selection of model parameters, the
The sample can obtain a more accurate estimate than the new data. In K-RE cross-validation, the sample is divided into K-sub-samples, and the k?1 sub-sample combination is taken as the training set, and the other 1 sub-samples as the retention set. This results in a K prediction equation, which records the predicted performance of the K-retained samples and averages them. [When n is the total number of observations, K is N, this method, also known as the Crossval () function in the]bootstrap package, can achieve K-RE cross-validation.)
FIT<-LM (Mpg~hp+wt+hp:wt,data=mtcars)
Shrinkage<-function (fit,k=10) {
Require (bootstrap)
Theta.fit<-function (x, y) {lsfit (x, y)}
Theta.predict<-function (fit,x) {cbind (1,x)%*%fit$coef}
X<-fit$model[,2:ncol (Fit$model)]
y<-fit$model[,1]
Results<-crossval (x,y,theta.fit,theta.predict,ngroup=k)
R2<-cor (y,fit$fitted.values) ^2
R2cv<-cor (Y,results$cv.fit) ^2
Cat ("Original r-square=", R2, "\ n")
Cat (K, "fold cross-validated r-square =", R2CV, "\ n")
Cat ("change=", R2-R2CV), "\ n")
}
R in Action reading notes (11)-eighth chapter: regression--Selecting the "Best" regression model