R in Action reading notes (10)-eighth chapter: Regression--improvement measures of abnormal observation value

Source: Internet
Author: User

8.4 Abnormal observation values

8.4.1 Off-Group Point

The car package also provides a statistical test method for outlier points. The Outliertest () function can obtain the maximum normalized residual value bonferroni the adjusted p-value:

> Library (CAR)

> Outliertest (FIT)

Rstudent unadjusted p-value Bonferonni p

Nevada 3.542929 0.00095088 0.047544

You can see that Nevada is determined to be a outliers (p=0.048). Note that the function simply determines if there are outliers based on the significance of a single maximum (or positive or negative) residual value. If not significant, it means that there are no outliers in the dataset, and if significant, you must delete the outlier and then verify that there are other outliers present.

8.4.2 High Lever value point

High leverage observation points, which are outliers associated with other predictor variables. In other words, they are combined by a number of abnormal predictor values, which are not related to the value of the response variable. High-leverage observation points can be judged by hat statistic. For a given data set, the Hat average is P/n, where p is the number of parameters estimated by the model (including intercept items), and n is the sample amount. In general, if the point of the hat value is more than 2 or 3 times times the average value of the hat, it can be considered as a high lever point.

Hat.plot<-function (FIT) {

P<-length (coefficients (FIT))

N<-length (fitted (FIT))

Plot (Hatvalues (FIT), main= "Index Plot of Hat Values")

Abline (h=c (2,3) *p/n,col= "red", lty=2)

Identify (1:n,hatvalues (FIT), names (Hatvalues (FIT))

}

Hat.plot (FIT)

8.4.3 Strong Impact Point

Strong impact point, that is, the model parameter estimates affect some proportion of the imbalance points. For example, if you remove an observation point of the model and the model changes dramatically, you need to check if there is a strong impact point in the data. There are two ways to detect strong impact points: Cook distance, or D statistic, and variable add graph (added variable

Plot). In general, Cook's D value is greater than 4/(N?k 1), which indicates that it is a strong impact point where n is the sample size and K is the number of predictor variables. Cook's D graphics can be drawn with the following code

> cutoff<-4/(Nrow (states)-length (fit$coefficients)-2)

> Plot (Fit,which=4,cook.levels=cutoff)

> Abline (h=cutoff,lty=2,col= "Red")

Cook's D chart helps identify strong impact points, but does not provide information about how these points affect the model. The variable add graph compensates for this flaw. The so-called variable is added to the graph, that is, for each predictor variable XK, plot the residual value of the regression on the other k? 1 Predictor variables relative to the residual value of the response variable on the other K-1 predictor variables. The Avplots () function in the car package provides a variable add graph: With the Influenceplot () function in the car package, you can also integrate the information from the outliers, levers, and strong impact points into a single drawing.

> Library (CAR)

> avplots (fit,ask=false,onepage=true,id.method= "identify")

> Influenceplot (fit,id.method= "identify", main= "Influenceplot", sub= "circle size is proportional to cook ' s distance" )

Reflects that Nevada and Rhode Island are outliers, New York, California, Hawaii and Washington.

There are high leverage values, Nevada, Alaska, and Hawaii for strong impact points.

8.5 Improvement measures

There are four ways to deal with problems that violate regression assumptions:

? Delete observation points;

? Variable transformation;

? Add or remove variables;

? Use other regression methods.

8.5.1 Deleting observation points

Deleting outliers usually increases the fit of the data set to the normal hypothesis, while a strong impact point interferes with the result and often

be deleted. After deleting the largest outliers or strong impact points, the model needs to be re-fitted. If the outliers or the strong influence points still exist,

Repeat the process until a satisfactory fit is obtained.

8.5.2 Variable Transformation

When a model does not conform to a normal, linear, or co-variance hypothesis, the transformation of one or more variables can often improve or adjust the model effect.

When a model violates a normal hypothesis, it is usually possible to try some kind of transformation on the response variable. The Powertransform () function in the car package uses the maximum likelihood estimation of λ to xλ the variable. Box-cox Normal Transformation:

> Library (CAR)

> States=data.frame (state.region,state.x77)

> Summary (Powertransform (States$murder))

Bcpower Transformationto Normality

Est.power Std.err. Wald lowerbound

States$murder 0.6055 0.2639 0.0884

Wald Upper Bound

States$murder 1.1227

Likelihood ratio testsabout Transformation parameters

LRT DF PVal

LR Test, lambda = (0) 5.665991 1 0.01729694

LR Test, lambda = (1) 2.122763 1 0.14512456

When a linear hypothesis is violated, it is often useful to transform the predictor variables. The Boxtidwell () function in the car package improves the linear relationship by obtaining the maximum likelihood estimate of the power number of the predictive variable. The following example Box-tidwell the model by predicting the murder rate using the state's population and illiteracy rate:

> Boxtidwell (murder~population+illiteracy,data=states)

Score statistic P-value MLE of Lambda

Population-0.3228003 0.7468465 0.8693882

Illiteracy 0.6193814 0.5356651 1.3581188

iterations = 19

8.5.3 Adding and deleting variables

Changing the model's variables will affect the fit of the model. Sometimes, adding an important variable solves many problems, and deleting a redundant variable can also achieve the same effect.

Deleting variables is a very important method when dealing with multiple collinearity. If you just make predictions, then multiple collinearity does not pose a problem, but if you want to explain each predictor variable, you have to solve the problem. The most common approach is to delete a variable that has multiple collinearity (a variable vif 2). Another available method is the ridge regression, a variant of multivariate regression, designed to handle multiple collinearity problems.

R in Action reading notes (10)-eighth chapter: Regression--improvement measures of abnormal observation value

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.