8.4 Abnormal observation values
8.4.1 Off-Group Point
The car package also provides a statistical test method for outlier points. The Outliertest () function can obtain the maximum normalized residual value bonferroni the adjusted p-value:
> Library (CAR)
> Outliertest (FIT)
Rstudent unadjusted p-value Bonferonni p
Nevada 3.542929 0.00095088 0.047544
You can see that Nevada is determined to be a outliers (p=0.048). Note that the function simply determines if there are outliers based on the significance of a single maximum (or positive or negative) residual value. If not significant, it means that there are no outliers in the dataset, and if significant, you must delete the outlier and then verify that there are other outliers present.
8.4.2 High Lever value point
High leverage observation points, which are outliers associated with other predictor variables. In other words, they are combined by a number of abnormal predictor values, which are not related to the value of the response variable. High-leverage observation points can be judged by hat statistic. For a given data set, the Hat average is P/n, where p is the number of parameters estimated by the model (including intercept items), and n is the sample amount. In general, if the point of the hat value is more than 2 or 3 times times the average value of the hat, it can be considered as a high lever point.
Hat.plot<-function (FIT) {
P<-length (coefficients (FIT))
N<-length (fitted (FIT))
Plot (Hatvalues (FIT), main= "Index Plot of Hat Values")
Abline (h=c (2,3) *p/n,col= "red", lty=2)
Identify (1:n,hatvalues (FIT), names (Hatvalues (FIT))
}
Hat.plot (FIT)
8.4.3 Strong Impact Point
Strong impact point, that is, the model parameter estimates affect some proportion of the imbalance points. For example, if you remove an observation point of the model and the model changes dramatically, you need to check if there is a strong impact point in the data. There are two ways to detect strong impact points: Cook distance, or D statistic, and variable add graph (added variable
Plot). In general, Cook's D value is greater than 4/(N?k 1), which indicates that it is a strong impact point where n is the sample size and K is the number of predictor variables. Cook's D graphics can be drawn with the following code
> cutoff<-4/(Nrow (states)-length (fit$coefficients)-2)
> Plot (Fit,which=4,cook.levels=cutoff)
> Abline (h=cutoff,lty=2,col= "Red")
Cook's D chart helps identify strong impact points, but does not provide information about how these points affect the model. The variable add graph compensates for this flaw. The so-called variable is added to the graph, that is, for each predictor variable XK, plot the residual value of the regression on the other k? 1 Predictor variables relative to the residual value of the response variable on the other K-1 predictor variables. The Avplots () function in the car package provides a variable add graph: With the Influenceplot () function in the car package, you can also integrate the information from the outliers, levers, and strong impact points into a single drawing.
> Library (CAR)
> avplots (fit,ask=false,onepage=true,id.method= "identify")
> Influenceplot (fit,id.method= "identify", main= "Influenceplot", sub= "circle size is proportional to cook ' s distance" )
Reflects that Nevada and Rhode Island are outliers, New York, California, Hawaii and Washington.
There are high leverage values, Nevada, Alaska, and Hawaii for strong impact points.
8.5 Improvement measures
There are four ways to deal with problems that violate regression assumptions:
? Delete observation points;
? Variable transformation;
? Add or remove variables;
? Use other regression methods.
8.5.1 Deleting observation points
Deleting outliers usually increases the fit of the data set to the normal hypothesis, while a strong impact point interferes with the result and often
be deleted. After deleting the largest outliers or strong impact points, the model needs to be re-fitted. If the outliers or the strong influence points still exist,
Repeat the process until a satisfactory fit is obtained.
8.5.2 Variable Transformation
When a model does not conform to a normal, linear, or co-variance hypothesis, the transformation of one or more variables can often improve or adjust the model effect.
When a model violates a normal hypothesis, it is usually possible to try some kind of transformation on the response variable. The Powertransform () function in the car package uses the maximum likelihood estimation of λ to xλ the variable. Box-cox Normal Transformation:
> Library (CAR)
> States=data.frame (state.region,state.x77)
> Summary (Powertransform (States$murder))
Bcpower Transformationto Normality
Est.power Std.err. Wald lowerbound
States$murder 0.6055 0.2639 0.0884
Wald Upper Bound
States$murder 1.1227
Likelihood ratio testsabout Transformation parameters
LRT DF PVal
LR Test, lambda = (0) 5.665991 1 0.01729694
LR Test, lambda = (1) 2.122763 1 0.14512456
When a linear hypothesis is violated, it is often useful to transform the predictor variables. The Boxtidwell () function in the car package improves the linear relationship by obtaining the maximum likelihood estimate of the power number of the predictive variable. The following example Box-tidwell the model by predicting the murder rate using the state's population and illiteracy rate:
> Boxtidwell (murder~population+illiteracy,data=states)
Score statistic P-value MLE of Lambda
Population-0.3228003 0.7468465 0.8693882
Illiteracy 0.6193814 0.5356651 1.3581188
iterations = 19
8.5.3 Adding and deleting variables
Changing the model's variables will affect the fit of the model. Sometimes, adding an important variable solves many problems, and deleting a redundant variable can also achieve the same effect.
Deleting variables is a very important method when dealing with multiple collinearity. If you just make predictions, then multiple collinearity does not pose a problem, but if you want to explain each predictor variable, you have to solve the problem. The most common approach is to delete a variable that has multiple collinearity (a variable vif 2). Another available method is the ridge regression, a variant of multivariate regression, designed to handle multiple collinearity problems.
R in Action reading notes (10)-eighth chapter: Regression--improvement measures of abnormal observation value