r--linear regression diagnosis (II.)

Source: Internet
Author: User

Linear regression Diagnosis--r

"Please specify the source when reproduced": http://www.cnblogs.com/runner-ljt/

Ljt Don't forget beginner's mind fearless future

as a beginner, the level is limited, welcome to communicate correct .

r--Linear regression diagnosis (a) The main content and basic methods of linear regression diagnosis are introduced.

As a further extension of the linear regression diagnosis in R, this paper mainly introduces the linear regression diagnosis using the correlation function in car bag.

> > Head (Bank)       y     x1     x2    x3     x41 1018.4  96259 2239.1 50760 1132.32 1258.9  97542 2619.4 39370 1146.43 1359.4  

  

test of positive sex:

Qqplot

draws a more accurate student-to-residual graph than the plot () function.

> Qqplot (fline)

the imaginary curves on both sides of the line represent confidence intervals, and the points falling outside the two curves can be considered as outliers.

Linearity test:

Crplots

It is possible to determine whether there is a nonlinear relationship between the dependent variable and the independent variable by means of the residual plot of the component residuals, which can eliminate the influence of other independent variables when testing the linear relationship.

The horizontal axis of the scatter point is Xi, and the longitudinal axes are θi*xi +ε

Linear relationships can be judged by whether the red and Green Line trends are consistent.  (Red line is y=θi*xi; The Green Line is the trend curve of the scatter point)

> Crplots (fline)

  

Variance test:

ncvtest

The original hypothesis: The variance of random error-------P-value >0.05 accept the original hypothesis, that is, there is no obvious difference in variance.

> > Ncvtest (fline) non-constant Variance score Test Variance formula: ~ Fitted.values chisquare = 0.2017512    Df = 1     

  

P-Value =0.653311>0.05 so there is no difference in variance.

Spreadlevelplot

Create a scatter plot of the absolute and fitted values of the normalized residuals.

If the output of the proposed power transformation (suggested power transformation) close to 1, then the variance is not obvious, that is, do not need to transform;

If the power transformation is 0.5, the square root y is substituted for y;

If the power transformation is 0, a logarithmic transformation is used.

> > Spreadlevelplot (fline) suggested power transformation:  

  

Self-correlation test:

in the basic assumptions of the linear regression model there is a cov (εi, εj) =0 hypothesis, if a model does not satisfy this formula, there is a autocorrelation phenomenon between the random error term.

Note : The autocorrelation here does not refer to the correlation between two or more than two variables, but rather to the correlation between a variable's pre-and post-value values.

The original hypothesis: there is correlation between random errors. -------P-Value >0.05 rejects the original hypothesis that there is no self-related phenomenon.

> > Durbinwatsontest (fline) Lag autocorrelation d-w statistic p-value   1       0.3578255       1.25138       

 It can be seen from the results that p<0.05 accepts the original hypothesis and has a serious self-correlation.

Co-linearity test:

Vif

vif:variance inflation factor variance enlargement factor

In general, VIFI>10 indicates the existence of multiple collinearity problems, and the multiple collinearity of the equation is caused by these variables of the vif>10.

> > Fline1<-lm (Y~x1+x2+x4,data=bank) > Vif (fline1)       x1        x2        x4  4.830666 91.196064 88.411675 > > Cor (bank[,c (2,3,5))          x1        x2        x4x1 1.0000000 0.8904046 0.8867331x2 0.8904046 1.0000000 0.9943239x4 0.8867331 0.9943239 1.0000000> > Fline2<-lm (Y~x1+x4,data=bank) > Vif (fline2)     x1      

  From the results of regression equation fline1, it can be seen that the vif value of X2 and X4 is significantly greater than 10, indicating that there is collinearity between the two variables.

At the same time, it can be seen from the simple correlation coefficient matrix that the correlation coefficient between X2 and X4 is 0.9943239, indicating a high correlation between the two.

It is possible to eliminate multi-collinearity by removing the vif largest variable, and the result of the regression equation Fline2 after the deletion of X2 does not have the obvious multi-collinearity phenomenon.

Outlier test:

Outlier points are divided into two cases: (1) about the dependent variable y anomaly, and (2) about the argument x anomaly

(1) Outliers: points with poor prognosis, with increased residuals.

Outliertest

Based on the significance of the maximum residual value, the presence of outliers is determined.

If not significantly bonferonni, it indicates that there is no outliers;

If the Bonferonni p<0.05 is significant, it indicates that the maximum residual value point is an outlier, which needs to be deleted, and then the outlier test is carried out again for the fitting model after the deletion of the point.

> > Outliertest (fline) No studentized residuals with Bonferonni p < 0.05Largest |rstudent|:    rstudent unadjust Ed P-value Bonferonni p16-2.879438           0.011463      0.24071>

  

(2 ) high leverage points

The distance from the sample is larger, and the regression parameters are affected more.

A high leverage point can be considered as a sample point with a value of more than one to twice the leverage.

> > Hatvalues (fline)        1         2         3         4         5         6         7 0.4453268 0.1937509 0.1943925 0.1376962 0.2137907 0.1647341 0.2542901         8         9        14 0.1114443 0.1203456 0.1075918 0.1372937 0.1113233 0.2690678 0.2546604        20        

  

from the lever value of each sample point, we can see that the lever value of the 21st sample point is significantly larger, which is a high leverage point.

(3) Strong impact point

A point that has a greater impact on the parameter estimation of the model (taking into account the residuals and the leverage values), deleting it will result in an essential change in the model.

Can be judged by Cook's distance.

> cooks.distance (fline)           1            2            3            4            5 1.146928e-01 8.816365e-06 1.721683e-03 1.180151e-02 5.950745e-02            6            7            8            9 1.188010e-02           1.049215e-03 1.595864e-02 5.529126e-03 7.215198e-04           1.040969e-05 5.131290e-04 3.465269e-01 1.077292e-01 1.045665e-01           17           1.554358e-01 1.388942e-03 1.154579e-01 1.330203e-01 5.371479e-03           

The cook value of the 21st sample point is obviously too large, so it has a strong influence.

Influence.measures

      sample points have a strong effect on the right with a * tag

> > Influence.measures (Fline) Influence measures of LM (formula = y ~ X1 + x2 + x3 + x4, data = Bank): dfb.1_  Dfb.x1 dfb.x2 dfb.x3 dfb.x4 dffit cov.r cook.d hat inf1 0.45826-0.568459 0.055623 0.512937 0.009836 0.75016 1.981 1.15e-01 0.445 *2-0.00507 0.004561-0.000964 0.000134 0.000275-0.00643 1.713 8.82e-06 0.194 3-   0.06156 0.064613-0.021146-0.035703 0.013871-0.08994 1.695 1.72e-03 0.194 4-0.15306 0.124729-0.077062 0.020415 0.062390-0.23797 1.425 1.18e-02 0.138 5-0.19112 0.045121-0.136219 0.376396 0.126321-0.54719 1.232 5.95e-02 0. 214 6-0.05657-0.004141-0.023041 0.165138 0.024055-0.23824 1.503 1.19e-02 0.165 7 0.00462 0.013033-0.01313 6-0.058289 0.012084 0.07016 1.843 1.05e-03 0.254 8-0.01447-0.000661-0.180381 0.032928 0.169471 0.27911 1.269 1.60e-02 0.111 9-0.03729 0.017606-0.096442 0.056441 0.084377 0.16202 1.473 5.53e-03 0.120 10-0.02597 0.0303  94-0.027975-0.0184650.020941 0.05821 1.533 7.22e-04 0.108 11-0.00312 0.002135-0.000075 0.004076-0.000837 0.00699 1.600 1.04e-05 0.13 7 0.03346-0.031808 0.013247-0.004924-0.005163-0.04908 1.544 5.13e-04 0.111 13-0.88710 1.114394-0.618970 -0.968065 0.435245 1.51702 0.331 3.47e-01 0.269 *14-0.46471 0.322235-0.157021 0.463529 0.054450 0.74845 1.103 1.   08e-01 0.255 0.54324-0.482824-0.052544-0.244757 0.172698-0.76306 0.704 1.05e-01 0.171 16 0.76658-0.744780 0.123881-0.025557 0.011793-1.06364 0.174 1.55e-01 0.120 17 0.01766-0.010056-0.043461-0.040832 0.045946-0.080 1.758 1.39e-03 0.221 0.05049 0.055714 0.583416-0.186662-0.554400 0.76448 1.399 1.15e-01 0.328 19 0.2049 4-0.110758 0.593949-0.153020-0.518155 0.81642 1.627 1.33e-01 0.392 20-0.04819 0.059309-0.002337-0.027989-0.02   2679-0.15900 1.909 5.37e-03 0.291 21-0.20629 0.281238 0.822119 0.113466-0.986662-1.30680 4.861 3.52e-01 0.762  *> >

  The results show that the 1,13,21 sample points have a greater impact.

Influenceplot

integrate outliers, high leverage points, and strong impact points into one diagram.

The vertical axis outside the -2~2 can be considered as outliers;

The horizontal axis is the lever value;

The circle size represents the influence value size.

r--linear regression diagnosis (II.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.