This article corresponds to "R language Combat" the 13th chapter: Generalized linear model
The generalized linear model expands the framework of the linear model and includes the analysis of the non-normal dependent variables.
Two popular models: Logistic regression (the dependent variable is category) and Poisson regression (the dependent variable is a count type)
The parameters of the GLM () function
Distribution family |
The default connection function |
Binomial |
(link = "logit") |
Gaussian |
(link = "Identity") |
Gamma |
(link = "inverse") |
Inverse.gaussian |
(link = "1/mu^2") |
Poisson |
(link = "Log") |
Quasi |
(link = "identity", Variance = "constant") |
Quasibinomial |
(link = "logit") |
Quasipoisson |
(link = "Log") |
The function of the conjunction
Summary ()
functions |
Description |
TD valign= "Top" width= "277" >
Show details of fitting model |
coefficients (), Coef () |
Lists the parameters (intercept items and slopes) of the fitted model |
Confint () |
Give the confidence interval of the model parameter (default is 95%) |
Residuals () |
Lists the residuals value of the fitted model |
Anova () |
Variance analysis table for two fitted models |
Plot () |
Generate a diagnostic diagram of the evaluation Fit model |
Predict () |
Predicting new datasets with the Fit model |
Model Fitting and regression diagnostics:
#诊断图 (the model below is GLM-fitted models) plot (Predict (model, type = "Response"), residuals (model, type = "deviance")) #帽子值 (hat value), Student residuals, cook distance statistic approximation plot (hatvalues (model)) plot (model) plot (cooks.distance (model)) #综合性诊断图library (CAR) Influenceplot (model)
The diagnostic diagram is useful when there are many response variables, and the diagnostic diagram is much less effective when the response variable has only a finite number of values, such as logistic regression.
Logistic regression
General process:
Firstly, all variables are fitted as predictive variables, and by means of the significance of regression coefficients, the variables which contribute significantly to the equation are screened and the model is fitted again. Using the ANOVA () function to test the goodness of fit of two nested models, the generalized linear regression can use Chi-square test, when the chi-square value is not significant, there is no difference between the model of a few variables and the model fitting effect of multivariable. (See the book on page 285-288 for specific examples)
Interpreting model Parameters:
In logistic regression, the response variable is the logarithmic advantage ratio (log) of the Y=1. The regression coefficient means that when the other predictor variables are unchanged, the change in the one-unit predictor variable can cause a change in the logarithmic advantage ratio of the response variable. Because the logarithmic advantage is worse than the explanatory, the result is often indexed and the confidence interval of the coefficients can be obtained using the Confint () function.
Exp (Confint (fit.reduced))
Evaluate the effect of predictive variables on the probability of a result:
Because thinking in terms of probability is more intuitive than using the advantage ratio, you can first create a virtual dataset that contains the values of the predictors of interest, and then use the Predict () function on the dataset to predict the resulting probabilities for those values. (See the book on page 289-290 for specific examples)
Over-departure potential:
The expected variance of the two-item distribution is that n is the number of observations and that pie is the probability of belonging to the Y=1 group.
The so-called over-trend, that is, the variance of the observed response variable is greater than the variance of the expected two-item distribution. Excessive dissociation can lead to singular standard false tests and imprecise significance tests.
The GLM () function can still be used to fit a logistic regression when there is an over-trend, but the two distributions need to be changed to the class two distribution (quasibinomial distribution).
One way to detect excess potential is to compare the residual deviation of the two-item distribution model with the residual degrees of freedom, if the ratio is:
is much larger than 1, it can be thought that there is excessive potential.
Specific test method: Fit two times model, first Use family = "binomial", the second use family = "Quasibinomial", remember the first return of the object is fit, the second return of the object is Fit.od,
PCHSQ (Summary (fit.od) $dispersion * fit$df.residual, fit$df.residual, lower = F)
The P-value provided can be used to test the 0 hypothesis h0:φ=1 and the alternative hypothesis h1:φ≠1, and if P is small, the H0 may be rejected.
Extension of logistic regression:
Robust logistic regression: the Glmrob () function in the robust package solves the problem of outliers and strong impact points
Polynomial distribution regression: the Mlogit () function in the Mlogit package, which is applied to the response variable that contains more than two unordered categories
Ordinal logistic regression: the LRM () function in the RMS package, which is applied to the response variable, is a set of ordered categories
Poisson regression:
Scope of application: a series of continuous and categorical predictor variables are used to predict the count-type result variable.
The interpretation of the analysis process and model parameters is similar to logistic regression.
Over-departure potential:
The variance of the Poisson distribution is equal to the mean value. When the variance observed by the response variable is larger than the variance predicted by the Poisson distribution, the Poisson regression may occur over-trend. Excessive departures often occur when processing data for counting, so extra attention is required.
Possible causes:
- Missing an important predictor variable;
- May be due to time-related;
- In longitudinal data analysis, the repeated measurement of data due to the intrinsic clustering characteristics can lead to excessive dissociation potential.
If there is an excess of potential, it will not be possible to interpret in the model, there may be a very small standard error and confidence interval, and the significance test is too loose (that is, you will find that the effect does not really exist).
Similar to logistic regression, if the ratio of residual deviations to residual degrees of freedom is much greater than 1, there is an excess of potential.
Test method: QCC Bag
Library (QCC) qcc.overdispersion.test (breslow.dat$sumy, type = "Poisson")
A significant P-value (less than 0.05) indicates the presence of excessive potential.
Similar to the over-potential processing of logistic regression, replace family = "Poisson" improvement with family = "Quasipoisson". It is important to note that the parameter estimation using the Poisson (Quasi-poisson) method is the same as the Poisson method, but the standard error becomes much larger.
Expansion of Poisson regression:
- Poisson regression with time period variation:
For the discussion of Poisson regression, the response variable is confined to a fixed length time period, and the time length of the whole observation set is constant. When allowing time periods to change, assume that the resulting variable is a ratio, that is, the
Modified to:
or equivalent form:
This is accomplished by adding the offset parameter to the GLM () function, offset = log (time).
- 0 Expanded Poisson regression:
The number of 0 counts in a dataset is often more than the number predicted by the Poisson model, and these 0 values are called struct 0 values. With 0 expanded Poisson regression analysis data, two models will be fitted at the same time, which can be considered as a combination of logistic regression and Poisson regression. The ZEROINFL () function in the PSCL package can do 0 expansion poisson regression.
- Robust Poisson regression:
The Glmrob () function in the robust package can fit a robust generalized linear model, including robust Poisson regression. Solve problems with outliers and strong impact points.
R language Combat (eight) generalized linear model