R Language Regression Chapter

R Language Regression Chapter _r

Last Update:2018-08-23 Source: Internet

Author: User

Tags mathematical functions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The multiple facets of regression

Regression type uses simple linear quantified explanatory variables to predict a quantified response variable (a dependent variable, an independent variable) polynomial a quantified explanatory variable predicts a quantified response variable, and the model relationship is
N-Order polynomial (a predictive variable, but at the same time contains the power of the variable multivariate linear prediction of a quantified response variable (more than one predictive variable) using two or more quantified explanatory variables multivariate use one or more explanatory variables to predict multiple response variables Logistic use one or more explanatory variables to predict a category variable Poisson uses one or more explanatory variables to predict a response variable that represents a frequency. Cox Proportional risk

Predict the time that an event (death, failure, or relapse) occurs with one or more explanatory variables

The nonlinear modeling of time series data related to error terms in time series predicts a quantified response variable with one or more quantified explanatory variables, but the model is nonlinear nonparametric to predict a quantified response variable with one or more quantified explanatory variables, the form source of the model
Self-data form, without pre-set robustness, one or more quantified explanatory variables are used to predict a quantified response variable, which can resist interference from strong influence points.
2.OLS return

OLS regression is to predict the quantified dependent variables by the weighting sum of the predictive variables, in which the weights are parameters which are estimated by the data.

Make residuals squared and smallest

To properly interpret the coefficients of the OLS model, the data must satisfy the following statistical assumptions:

(1) The normality is normal to the fixed independent variable and the value of the variable

(2) Independence of the Yi value of each other

(3) The linear dependent variable is linearly correlated with the independent variable.

(4) The variance of the same variance due to variable does not vary with the level of the independent variable, that is, the invariant variance or the homogeneity of variance

3. Using LM () to fit the regression model

The most basic function of the fitted linear model is LM (), which is in the following format:

MYFIT<-LM (Formula,data)

Formula refers to the form of the model to be fitted, data is a box that contains data for fitting the model

Formula form is as follows: Y~x1+x2+......+xk (~ to the left of the response variable, the right for each predictor variable, predictor variables separated by a + symbol)

The symbols commonly used in R-expressions

Symbol

Use

Separator symbol, left for response variable, right for explanatory variable, eg: to predict y through X, z and W, code is Y~X+Z+W

Separating forecast variables

：

Represents the interaction of a predictive variable eg: to predict y through the interaction of X, Z, and X and Z, the code is Y~X+Z+X:Z

Represents a succinct way of all possible interactions, and code y~x*z*w can be expanded to y~x+z+w+x:z+x:w+z:w+x:z:w

Indicates that the interaction item reaches a certain number of times, and the Code y~ (x+z+w) ^2 can be expanded to y~x+z+w+x:z+x:w+z:w

Represents all variables except the dependent variable, eg: If a data box contains variables x, Y, Z, and W, the code y~. can be expanded to y~x+z+w

A minus sign that indicates that a variable is removed from the equation, and eg:y~ (x+z+w) ^2-x:w can be expanded to y~x+z+w+x:z+z:w

-1

Delete a intercept item, eg: Represents the regression of y~x-1 fitting y on X, and forces the line to pass through the origin.

I ()

Explains the elements in parentheses in terms of arithmetic. eg:y~x+ (Z+W) ^2 will be expanded to y~x+z+w+z:w. Instead, code y~x+i ((z+w) ^2) expands to Y~x+h,h is a new variable created by the squared sum of Z and W

function

Mathematical functions that can be used in expressions, such as log (y) ~x+z+w, to predict log (y) through X, Z, and W

Other functions that are useful for fitting linear models

Function

Use

Summary ()

Show detailed results of fitting

Coefficients ()

Lists the model parameters (intercept items and slopes) of the fitted model

Cofint ()

Provides confidence intervals for model parameters (default 95%)

Fitted ()

List the predicted values of the fitted model

Residuals ()

List residual values for fitted models

Anova ()

Generate an analysis of the variance of a fitted model, or compare the variance tables of two or more fitted models

Vcov ()

List covariance matrices for model parameters

AIC ()

Output Red Pool information statistics

Plot ()

A diagnostic diagram of generating evaluation fitting model

Predict ()

Using fitting model to predict response variable value of new dataset
4. Simple linear regression

FIT<-LM (Weight~height,data=women)
Summary (FIT)

In the PR (>|t|) column, you can see that the regression coefficient (3.45) is significantly less than 0 (p<0.001), indicating that the weight will be expected to increase by 3.45 pounds every 1 inches of height.

The R-squared term (0.991) indicates that the model can explain the variance of weight 99.1%, which is also the correlation coefficient between the actual and the predicted values (r^2=r^2)

The standard error of residuals (1.53lbs) can be considered to predict the average deviation of body weight with height

F Statistics Check whether all predictive variables predict response variables are above a certain probability level

Fitted (FIT) #拟合模型的预测值

Residuals (FIT) #拟合模型的残差值

Plot (women$height,women$weight,
     xlab= "height (in inches)",
     ylab= "weight (in pounds)")
Abline (FIT)

5. Polynomial regression

FIT2<-LM (Weight~height+i (height^2), data=women)
Summary (FIT2)

Plot (women$height,women$weight,
     xlab= "height (in inches)",
     ylab= "weight (in lbs)")
lines (Women$height, Fitted (FIT2))

In general, n-th polynomial generates a curved curve of n-1

The Scatterplot () function in the car package makes it easy and convenient to draw a two-dollar diagram

Scatterplot (Weight~height,
            data=women,
            spread=false,
            lty.smooth=2,
            pch=19,
            main= "women") Age 30-39 ",
            xlab=" Height (inches),
            ylab= "Weight (lbs.)")

6. Multiple linear regression

Data set used: state.x77

States<-as.data.frame (State.x77[,c ("Murder", "Population", "illiteracy", "income", "Frost")]

Detecting the relationship between two variables

Cor (states)

Library (CAR)
Scatterplotmatrix (states,spread=false,lty.smooth=2,main= "Scatter Plot Matrix")

The Scatterplotmatrix () function, by default, draws a scatter plot between variables in a non diagonal area and adds smoothing (loess) and linear fitting curves

Multivariate linear regression

FIT<-LM (murder~population+illiteracy+income+frost,data=states)
Summary (FIT)

7. Multivariate linear regression with interactive term

FIT<-LM (Mpg~hp+wt+hp:wt,data=mtcars)
Summary (FIT)

By using the effect () function in the effects package, you can display the results of the interaction items graphically

FIT<-LM (Mpg~hp+wt+hp:wt,data=mtcars)
Summary (FIT)

install.packages ("Effects")
library (effects)
Plot (effect ("HP:WT", Fit,
            list (Wt=c (2.2,3.2,4.2)), multiline=true)

8. Regression diagnosis (1) Standard method

Simple linear regression

FIT<-LM (weight~height,data=women)
par (mfrow=c (2,2))
plot (FIT)

Normality: When the value of the predictive variable is fixed, the residual graph should be a normal state with a mean value of 0. The normal q-q graph is the probability graph of the normalized residuals on the corresponding value of the normal state, and if the normal hypothesis is satisfied, then the point on the graph should fall on the line with a 45 degree angle, if not, it violates the normality hypothesis.

Independence: can only be validated from the collected data.

Linear: If the dependent variable is linearly correlated with the independent variable, then the residual value and the predictive (fitted) value are not related to the task system, and if there is a relationship, it means that the regression model should be adjusted in the city.

Same variance: If the invariant variance hypothesis is satisfied, the points around the horizontal line should be randomly distributed in the position scale graph (scale-location graph).

Two-time fitting diagnostic map

FIT2<-LM (Weight~height+i (height^2), data=women)
par (mfrow=c (2,2))
plot (fit2)

(2) The improved method

Regression Diagnostic utility function (in CAR package)

Function Objective Qqplot () the Durbinwatsontest () to do Durbin-watson test of error autocorrelation crplots () component and residual plot ncvtest () to do the scoring test for the unsteady error variance SPREADLEVELPL OT () dispersion level test outliertest () Bonferroni outlier Test avplots () added variable graph inluenceplot () regression effect graph Scatterplot () Enhanced scatter plot scatterplotmatrix ( The enhanced scatter Graph matrix vif () variance expansion factor

Another Gvlma package provides a way to test all linear models

Normality:

Compared with the plot () function, the Qqplot () function provides a more accurate method of normal hypothesis testing, and draws a student residuals graph in the T distribution of n-p-1 degrees of freedom, n is the sample size, and P is the number of regression parameters (including intercept items).

Library (CAR)
fit<-lm (murder~population+illiteracy+income+frost,data=states)
Qqplot (fit,labels= Row.names (states), id.method= "identify", simulate=true,main= "Q-q Plot")

Function of drawing student residuals graph

Residplot<-function (fit,nbreaks=10) {
  z<-rstudent (FIT)
  hist (Z,breaks=nbreaks,freq=false,
       Xlab = "studnetized residual",
       main= "distribution of Errors")
  rug (Jitter (z), col= "Brown")
  curve (Dnorm (X,mean =mean (z), SD=SD (z)),
        add=true,col= "Blue", lwd=2)
  lines (density (z) $x, density (z) $y, col=
        "Red", lwd=2, lty=2)
  Legend ("TopRight",
         legend=c ("Normal Curve", "Kernel density Curve"),
         lty=1:2,col=c ("Blue", " Red "), cex=0.7)}
Residplot (FIT)

The independence of error:

It has been mentioned that the dependent variables can be judged independently based on the data collected

The car package provides a function that can be used for Durbin-watson inspection, and can detect the sequence correlation of errors

Durbinwatsontest (FIT)

Linear:

It is possible to judge whether the dependent variable and the independent variable are non-linear, or whether it is different from the system deviation of the set linear model, and draw the Crplots () function in the car package.

Library (CAR)
crplots (FIT)

If the graph is non-linear, it indicates that the function form of the predictive variable may not be fully modeled.

The car package provides two useful functions to determine whether the error variance is constant

Ncvtest () function generates a scoring test, 0 assumes the error variance is unchanged

The Spreadlevelplot () function creates a scatter chart that adds the best fit curve, showing the relationship between the absolute value of the normalized residuals and the fitted values

Check the same variance:

Library (CAR)
ncvtest (FIT)
Spreadlevelplot

(3) Comprehensive verification of linear model hypothesis

The Gvlma () function in the Gvlma package

Install.packages ("Gvlma")
library (Gvlma)
Gvmodel<-gvlma (Fit)
Summary (Gvmodel)

(4) Multiple collinearity

VIF (Variance inflation Factor, variance expansion factor) for detection

Under the general principle, (VIF) ^1/2 >2 indicates the existence of multiple collinearity problems

Library (CAR)
vif (Fit)
sqrt (vif) >2

9. Anomaly observations (1) Outlier point

Outliers refer to those observational sites with poor predictive effects, usually with large, positive or negative residuals, and the positive residuals indicate that the model underestimates the response value, and the negative residuals indicate that the Gao You response value

Library (CAR)
outliertest (FIT)

The Outliertest () function is based on the significance of a single maximum (or positive or negative) residual value to determine whether there are outliers. If it is not significant, there is no outlier in the dataset, and if it is significant, you must delete the outlier point and then verify that there are other outliers. (2) High lever value point

A highly leveraged observation point, that is, an outlier that is associated with other predictive variables, that is, they are grouped by predictive variables of many exceptions, and are not related to the value of the response variable.

High-lever observation points can be judged by hat statistics (cap statistic). For a given dataset, the hat mean is p/n, where p is the number of parameters (including intercept items) that the model estimates, and n is the sample size. In general, if the hat value of the observation point is greater than 2 or 3 times times the hat mean value, it can be identified as a high lever value point.

Hat.plot<-function (Fit) {
  p<-length (coefficients (FIT))
  N<-length (fitted (FIT))
  plot ( Hatvalues (FIT), main= "Index Plot of Hat Values")
  Abline (h=c (2,3) *p/n,col= "red", lty=2)
  identify (1:n, Hatvalues (Fit), names (Hatvalues (FIT))
}
Hat.plot (FIT)

(3) Strong influence Point

A strong influence point, that is, a point in which the estimated value of the model parameter is affected by some proportional imbalance. For example, if the model changes dramatically when you remove an observation point from the model, you need to check to see if there are any strong impact points in the data.

Detection method

Cook distance, or called D statistic Cook's D value greater than 4/(n-k-1), indicates that it is a strong impact point where n is the sample size, and K is the number of predictive variables (which helps to identify strong impact points but does not provide information about how these points affect the model).

Variable add graph (added variable plot) (to compensate for this defect) (for each predictive variable XK, draw the residual value of XK on other k-1 predictive variables relative to the residual value of the response variable returning on other k-1 predictive variables)

cutoff<-4/(Nrow (states)-length (fit$coefficients)-2)
plot (Fit,which=4,cook.levels=cutoff)
Abline (h= Cutoff,lty=2,col= "Red")

Library (CAR)
avplots (fit,ask=false,onepage=true,id.method= "identify")

The Influenceplot () function in the car package, which consolidates information from the outliers, lever points, and strong impact points into a graphic

Library (CAR)
Influenceplot (fit,id.method= "identify", main= "Influence Plot",
              sub= "Circle size if Proportional to Cook ' s distance ")

Impact diagram. States with an ordinate of more than 2 or less than 2 may be considered to be outliers, and states with a horizontal axis of more than 0.2 or 0.3 have high leverage (usually a combination of predicted values). The size of the circle is proportional to the effect, and the large point of the circle may be a strong influence on the disproportionate impact of the model estimates.

10. Improved measures (1) deletion of observation points

Deleting observation points improves the fitting of the dataset to the normal hypothesis, and the strong influence point interferes with the result and is usually deleted. To remove the maximum outliers or strong influence points, the model needs to be fitted again, if the outliers or strong influence points still exist, repeat the process until a satisfactory fit is obtained.

Caution should be taken with respect to the deletion of observation points. (2) Variable transformation

The transformation of one or more variables usually improves or adjusts the model effect when the model does not conform to the assumption of normality, linearity, or homogeneity of variance.

When a model violates a normal assumption, it is usually possible to try some transformation on the response variable.

Powertransform () function in CAR package

Box-cox Normal Transformation

Library (CAR)
Summary (Powertransform (States$murder))

(3) Adding or deleting variables

Changing the model's variables will affect the model's fit, adding or removing variables

Multi-collinear problem: Ridge regression

11. Select the "Best" regression model (1) model comparison

Anova () function to compare the fit of two nested models

A nested model means one of its items. Complete diet in another model

Using the ANOVA () function to compare

FIT1<-LM (murder~population+illiteracy+income+frost,data=states)
fit2<-lm (Murder~Population+ illiteracy,data=states)
Anova (fit2,fit1)

Model 1 is nested in Model 2, the test is not significant, the basic knowledge does not need to add income and frost to the linear model, you can remove them from the model

AIC (Akaike information criterion, red Pool information criterion) can be used to compare models, considering the statistical fit of the model and the number of parameters used to fit it.

The smaller the AIC value, the better the selection of the model, the more fitting of the model with fewer parameters.

FIT1<-LM (Murder~population+illiteracy+income+frost,
         data=states)
fit2<-lm (murder~population+ illiteracy,data=states)
AIC (fit1,fit2)

(2) Variable selection stepwise regression (stepwise method):

Forward stepwise regression (forward stepwise) add one predictor variable to the model at a time, until the addition of the variable does not improve the model.

Backward stepwise regression (backward stepwise) starts with all the forecast variables in the model and deletes one variable at a time until the model quality is reduced.

Forward and backward gradually (stepwise stepwise stepwise regression)

The Steaic () function in mass packet can realize stepwise regression model, which is based on the precise AIC criterion.

The back-to-return
library (MASS)
fit1<-lm (murder~population+illiteracy+income+frost,data=states)
Stepaic (Fit, direction= "Backward")

Full subset regression (all-subsets regression):

Full subset regression, that is, all possible teams of the team are tested, optionally showing all possible results, or showing the optimal model for n different subset sizes (one, two, or more)

The Regsubsets () function in the leaps package can be implemented

The "best" model can be selected by R-Squared, adjusting R-squared, or mallows CP statistics.

R-Squared is the degree to which predictive variables interpret response variables

Adjusting R squared is similar, but considering the number of parameters of the model

Mallows CP statistics are also used as the rule for the stepwise regression, and for a good model, its CP statistic is very near to the number of parameters (including intercept items).

Install.packages ("Leaps")
library (Leaps)
leaps<-regsubsets (murder~population+illiteracy+income+ frost,data=states,nbest=4)
plot (leaps,scale= "ADJR2")

Library (CAR)
subsets (leaps,statistic= "CP", main= "CP Plot to all subsets regression")
Abline (1,1,lty=2,col= " Red ")

12. Deep analysis (1) Cross-validation

Cross-validation is to select a certain proportion of the data as a training sample, the other sample as a retention sample, first in the training sample to obtain the regression equation, and then on the retention samples to make predictions. Since the retention sample does not involve the selection of models and parameters, the sample can obtain a more accurate estimate than the new data.

K-weight crossover difficult, the sample is divided into K-sub samples, taking the K-1 subgroup as a training set, the other 1 samples as a retention set, this will get K prediction equation, record K retained samples of the predicted performance results, and then the average. "When n is the total number of observations and K is N, the method is also known as the knife-cutting Method (jackknifing)"

The Crossval () function in the bootstrap packet can achieve K-RE cross-validation

Install.packages ("bootstrap")
library (bootstrap)
shrinkage<-function (fit,k=10) {
  require ( Bootstrap)
  theta.fit<-function (x,y) {Lsfit (x,y)}
  theta.predict<-function (fit,x) {cbind (1,x)%*%fit$ COEF}
  X<-fit$model[,2:ncol (Fit$model)]
  y<-fit$model[,1]
  results<-crossval (X,y,theta.fit, theta.predict,ngroup=k)
  R2<-cor (y,fit$fitted.values) ^2
  R2cv<-cor (y,results$cv.fit) ^2
  Cat (" Original r-square= ", r2," \ n ")
  Cat (k," Fold cross-validated r-square= ", R2CV," \ n ")
  Cat (" change= ", R2-R2CV," \ n ")
}
fit<-lm (murder~population+income+illiteracy+frost,data=states)
shrinkage (FIT)
FIT2<-LM (murder~population+illiteracy,data=states)
shrinkage (fit2)

(2) Relative importance

Zstates<-as.data.frame (Scale (states))
ZFIT<-LM (murder~population+income+illiteracy+frost,data= zstates)
Coef (Zfit)

Relative weight: is an approximation of the R-squared increase caused by adding a predictive variable to all possible child models.

Relweights<-function (Fit,......) {
  R<-cor (Fit$model)
  nvar<-ncol (R)
  Rxx<-r[2:nvar,2:nvar]
  rxy<-r[2:nvar,1]
  SVD <-eigen (rxx)
  evec<-svd$vectors
  ev<-svd$values
  delta<-diag (sqrt (EV))
  lambda<- Evec%*%delta%*%t (Evec)
  lambdasq<-lambda^2
  beta<-solve (lambda)%*%rxy
  rsqrare<-colsums ( beta^2)
  rawwgt<-lambdasq%*%beta^2
  import<-(rawwgt/rsquare) *100 lbls<-names
  (fit$model[2 : Nvar])
  rownames (import) <-lbls
  colnames (import) <-"Weight"
  barplot (import), names.arg= Lbls,
          ylab= "% of R-square",
          xlab= "Predictor Variables",
          main= "relative importance of Predictor Variables ",
          sub=paste (" r-square= ", Round (rsquare,digits=3)),......)
  Return (import)
}
fit<-lm (murder~population+illiteracy+income+frost,data=states)
relweights ( Fit,col= "Lightgrey")

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More