R-Regression-ch8

Source: Internet
Author: User
Tags square root

1. The multi-faceted nature of regression

(1) Use Scenarios for OLS regression

OLS regression is the weighted sum of predictor variables (i.e. explanatory variables) to predict the quantified dependent variables (i.e., response variables), where weights are parameters that are estimated by the data.
2. OLS regression

The OLS regression fits the form of the model:

(1) Fitting the regression model with LM ():

The function used to draw a picture of a real sample and a fitted curve:

Abline ()

Lines ()

(2) Simple linear regression

FIT<-LM (Weight~height,data=women)

Plot (Women$height,women$weight)

Abline (FIT)

(3) Polynomial regression

Form: y=a+b1x+b2x^2+b3x^3

FIT2<-LM (Weight~height+i (height^2), data=women)

Plot (Women$height,women$weight)

Lines (women$height,fitted (fit2))

The Scatterplot () function in the car package makes it easy and easy to draw a two-dollar diagram. It provides both height and weight scatter plots, linear fit curves, and smooth fit (loess) curves, as well as a box plot of each variable at the corresponding boundary. It can be found that fitting linear or polynomial regression is more appropriate.

(4) Multivariate linear regression

Because the LM () function requires a data frame (the state.x77 dataset is a matrix), the object is converted using the As.data.frame () function.

In multivariate regression analysis, the first step is better to examine the correlations between variables, including explanatory variables and interpreted variables. The Cor () function provides a correlation coefficient between the two variables, and the Scatterplotmatrix () function in the car package generates a scatter graph matrix.
(5) Multivariate linear regression with interacting items

3. Regression diagnosis

Use the LM () function to fit the OLS regression model, which relies on the model to satisfy the statistical assumptions in the multi-OLS model. The summary () function describes the model as a whole, but it does not provide any information about the extent to which the model satisfies the statistical assumptions. So the following is a regression diagnosis.

(1) Standard method

A large number of methods for verifying statistical assumptions in regression analysis are provided in the R Foundation installation. The most common approach is to use the plot () function for objects returned by the LM () function, which can generate four graphs that evaluate the fit of the model.
Statistical assumptions for OLS regression:

* Normality. When the Predictor value is fixed, the dependent variable is normally distributed, and the residuals should also be a normal distribution with a mean value of 0. Normal QQ graph is a probability map of standardized residuals under the corresponding value of normal distribution. If the normal distribution is satisfied, the points on the graph should fall on a straight line at a 45-degree angle.

Independence The dependent variable values are independent of each other (or the residuals are independent from each other) and are not distinguishable from the four graphs and can be verified from the collected data.

Linear In this is called linear somewhat one-sided. If the established OLS regression model fits well, then the residual value is not correlated with the model fit value. That is to say, the model extracts all the information, and the remaining residuals are a white noise. View in the residual plot and fit graph (residuals vs fitted) diagram.

* Same variance nature. The variance of the dependent variable does not change as the argument changes. If the same variance hypothesis is satisfied, the points around the horizontal line in the position scale graph (scale-location graph) should be randomly distributed.

Diagram four: residuals and leverage graphs (residuals vs Leverage) provide information about individual observations that you might be interested in. Outliers, high leverage points, and strong impact points.

(2) Methods of improvement

* Normality:

The following 2 methods verify the normality of the residuals.

The Qqplot () function of the car bag draws the student residual map under the t distribution of the n-p-1 degrees of freedom.

The Residplot () function generates a student-shaped residual histogram and adds a normal curve, a kernel density curve, and an axial whisker map.

* Independence of error: self-correlation of test error

DW Inspection

The Durbinwatsontest () function provided by the car package verifies the sequence correlation of the error.

Linear

With component residuals (component plus residual plot) also known as partial residual plots (partial residual plot), you can see if the dependent variable is not linearly related to the individual arguments, It is also possible to see if there is a system bias different from the linear model that has been set (if the graph is non-linear, you might not be able to model the function form of the predictor sufficiently, then you need to add some curve components), and the graph can be drawn using the Crplots () function in the car package.

* Same Variance nature:

Determine whether the error variance is constant, car package provides 2 functions.

The Ncvtest () function generates a scoring test, and 0 assumes that the error variance is the same, and the alternative hypothesis is that the error variance varies with the level of the fitting value.

The Spreadlevelplot () function creates a scatter plot with the best fit curve, showing the relationship between the absolute value of the normalized residuals and the fitted values.

A power transformation (suggested power transformation) is recommended if there is an XOR variance. The implication is that, after the P-power (Y-P) transformation, the non-constant error variance will be smooth. For example, if the graph shows a non-horizontal trend, the proposed power conversion to 0.5, in the regression equation with the square root y instead of y, may cause the model to satisfy the same variance.

(3) Comprehensive verification of linear model hypothesis
The Gvlma () function in the Gvlma package can comprehensively verify the hypothesis of linear model, and can also evaluate the skewness, kurtosis and variance. In other words, it provides a single, comprehensive test of the model hypothesis (pass/no pass). If not, use the previous method to determine which assumptions are not met.

(4) Multi-collinearity

For multivariate regression to detect the existence of correlations between explanatory variables.

Scenario: When the F test is significant, but the regression coefficients of the explanatory variables are not significant, the existence of multi-collinearity is considered.

The regression coefficients measure the effect of a predictor variable on the response variable when other predictor variables are unchanged.

4. Abnormal observation value

(1) Outlier point

(2) High lever value point

High leverage observation points, which are outliers associated with other predictor variables. In other words, they are combined by a number of abnormal predictor values, which are not related to the value of the response variable.
High-leverage observation points can be judged by hat statistic. For a given data set, the Hat average is P/n, where p is the number of parameters estimated by the model (including intercept items), and n is the sample amount. In general, if the point of the hat value is more than 2 or 3 times times the average value of the hat, it can be considered as a high lever point.
Hatvalues () function
High leverage points may or may not be a strong point, and this depends on whether they are outliers.

(3) Strong impact point

Strong impact point, that is, the model parameter estimates affect some proportion of the imbalance points.

There are two ways to detect strong impact points: Cook distance, or D statistic, and variable add graph (added variable plot). In general, Cook's D value is greater than 4/(N?k 1), which indicates that it is a strong impact point where n is the sample size and K is the number of predictor variables.

Not read

5. Improvement measures

(1) Delete observation points

Discreet operation

(2) Variable transformation

When a linear hypothesis is violated, it is often useful to transform the predictor variables. The Boxtidwell () function in the car package improves the linear relationship by obtaining the maximum likelihood estimate of the power number of the predictive variable.

The response variable transformation can also improve the variance (the error variance is not constant) you can see the power transformation application provided by the Spreadlevelplot () function in the car bag.

(3) Adding and deleting variables

(4) Try another method

6. Select the "Best" regression model

The model is not optimal and is judged by the actual worker. The selection of the final regression model involves the harmonization of the predictive precision (model goodness of fit) and the simplification of the model.

(1) Comparison of models

Method One: Use the ANOVA () function in the base installation to compare the goodness of fit for two nested models. The so-called nested model, that is, some of its items are completely contained in another model.

Method Two: AIC (Akaike information Criterion, Red Pond information criterion) can also be used to compare models, which takes into account the statistical fit of the model and the number of parameters to be fitted. The model with the smaller AIC value is preferred, which shows that the model obtains enough fit with fewer parameters.

(2) Variable selection

There are two popular ways to select the final predictor variables from a large number of candidate variables: stepwise regression (Stepwisemethod) and full subset regression (all-subsets regression).

* Progressive regression

The AIC value in <none> in the results indicates that no variables were deleted when the model's AIC.

Cons: stepwise regression may not evaluate all possible models, so a good model that is ultimately found is not necessarily the best model. So the whole subset regression method was produced.

* Full subset Regression

Method One:

Method Two:

The Mallows CP statistic is also used as a stop rule for stepwise regression. Extensive research has shown that for a good model, its CP statistics are very close to the number of parameters (including intercept items) of the model.
Draw with the subsets () function in the car bag.

In most cases, full subset regression is better than stepwise regression because more models are considered. However, when there are a large number of predictor variables, the full subset regression is slow. In general, automatic variable selection should be considered as an adjunct to the model selection, rather than as a direct method. A model that fits well without meaning is not helpful to you, and the understanding of the subject matter background will ultimately guide you in getting the ideal model.

7. Deep analysis
This paper introduces the methods of evaluating model generalization ability and relative importance of variables.

(1) Cross-validation

Through the cross-validation method, we evaluate the generalization ability of the regression equation. That is, how the regression equation predicts the new observational sample.

So-called cross-validation, a certain proportion of the data is selected as a training sample, the other sample as a retention sample, first on the training sample to obtain the regression equation, and then on the reservation sample to make predictions. Since the retention sample does not involve the selection of model parameters, the sample can obtain a more accurate estimate than the new data.
(2) Relative importance
Which explanatory variable is most important for forecasting

If the Predictor variables are irrelevant, the process is much simpler, and you can sort by the correlation coefficients of the predictor variables and the response variables. However, in most cases, there is a correlation between predictor variables, which makes the evaluation much more complex.

Method One:

The simplest is to compare normalized regression coefficients, which indicate that a change in the standard deviation of a predictor variable can cause the expected change in the response variable (measured in standard deviation units) when the other predictors are unchanged. Before regression analysis, the scale () function can be used to normalize data to a data set with a mean of 0 and a standard deviation of 1, so that normalized regression coefficients can be obtained with R regression. (Note that the scale () function returns a matrix, and the LM () function requires a data frame, and you need to use an intermediate step to convert it. )

Method Two:

Relative weight.











R-Regression-ch8

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.