The simple linear regression model is described earlier, followed by the multiple linear regression model.
Simple linear regression is a linear regression relationship between a dependent variable and an independent variable, whereas multiple linear regression refers to a linear regression relationship between a dependent variable and multiple independent variables. Compared with simple linear regression, multiple linear regression is more practical, because in real life, multi-factor interaction is very common, and the influence of the dependent variable is often more than one independent variable.
The main problem solved by multiple linear regression is
1. Estimating the linear relationship between the independent variable and the dependent variable (estimating the regression equation)
2. Determine which independent variables have an impact on the dependent variable (impact factor analysis)
3. Determine which argument has the most impact on the dependent variable, which is the minimum (argument importance analysis)
4. Predict the dependent variable using an independent variable, or under the premise of controlling certain independent variables (predictive analysis)
The basic model of multiple linear regression equations is
In the formula:
Β0 and B0 are constant entries
Βk and BK are partial regression coefficients, which indicate that when other independent variables are fixed, an independent variable changes one unit, and the corresponding Y transformation value
μ and e are error items, that is, parts of the Y change that cannot be interpreted by existing arguments
===============================================
Partial regression coefficients
Partial regression coefficients are the main difference between multiple linear regression and simple linear regression, and to investigate the effect of an independent variable on the dependent variable, it is necessary to assume that the other arguments remain unchanged.
Normalization of the partial regression coefficients:
The partial regression coefficients are dimensional, due to the different unit dimensions of the respective variables, resulting in their partial regression coefficients can not be directly compared, if we want to comprehensively evaluate the contribution of the respective variables to the dependent variable y size, it is necessary to standardize the partial normalization coefficient, after standardization of the partial regression coefficient has no units, the greater the coefficient, It shows that the influence of the independent variable on y is greater.
The calculation method of partial normalization coefficient is:
=====================================================
Applicable conditions of multiple linear regression
1. Linear: There is a linear relationship between the dependent variable and the respective variable, which can be judged by the scatter graph matrix.
2. No autocorrelation: Any two XI, XJ corresponding to the random error μi,μj is independent of the unrelated
3. Random error obeys mean value 0, variance is normal distribution with a certain value
4. Under x Certain conditions, the variance of the residuals is equal (a constant), i.e. the variance homogeneity
The above four points are similar to simple linear regression, need to be judged by residual plot, if not satisfied, need to make corresponding changes, do not meet the linear conditions need to modify the model or use curve fitting, do not meet the 2, 3 points to be variable conversion, not meet the 4th do not use least squares to estimate the regression parameters.
The method of parameter estimation of multiple linear regression also uses least squares, but compared with simple linear regression, because the self-variable is more than 1, it is cumbersome to calculate and needs computer, so the main statistic software can give the result directly.
===================================================
The test of multiple linear regression model
1. Goodness of Fit test
The multiple linear regression fitting goodness test is also used to judge by calculating the coefficient of determination, the principle and method of calculation is the same as the simple linear regression. In this case, the complex correlation coefficient is the index of the correlation between the dependent variable and all the independent variables in the multiple linear regression equation, and the calculation method is the root of the judging coefficient, but the complex correlation coefficient increases with the increment of the number of independent variables, so it is improper to use the complex correlation coefficient to measure the merits of the equation. The complex correlation coefficient rad is generally used to calculate the formula
The complex correlation coefficient of correction increases when there are statistically significant independent variables entering the equation and decreases when there is no statistically significant independent financial equation, so rad can effectively measure the merits and demerits of multiple linear regression equations, and can also be used as an indicator for filtering variables.
Another indicator is the residual standard deviation, used to measure the accuracy of the regression equation, the residual standard difference is less than the estimated value of the observed value, the reverse is larger, the formula is:
2. Test of the significance of the regression equation (F-Test)
As with the simple linear regression equation, the significance test of the multiple linear regression equation also refers to whether the linear regression relationship between the explanatory variable and the interpreted variable in the opposing path is significantly established to determine whether all the estimated regression coefficients in the equation are all 0 at the specified alpha level, and in general, if the equation is established , then at least one of these regression coefficients is not 0, the main one is not 0, the equation is formed in the form. The test principle is also based on the total difference decomposition to build f-Statistic test, the calculation formula of statistics and simple linear regression like
For the regression equation yi=β0+β1x1+β2x2+...+βixi+μi
Put forward the original hypothesis
H0:β1=β2=βi=0
H1:β1,β2 ... Βi is not all 0
Calculates the value based on the f statistic, and queries the critical value Fα (M,N-M-1) based on the significance level α, and N is the number of the independent variables of the sample size M
If F>fα, the H0 is rejected, and the estimated parameters are not 0 at the same time, the regression equation is significant
If F<fα, then accept the H0, that the estimated parameters are 0, the regression model is not significant
3. Test of the significance of the variable (t test)
After passing the significance test of the regression equation, we can conclude that at least one of the regression coefficients is not 0, which means that at least one of the independent variables has a linear relationship with the dependent variable, which does not mean that each independent variable is so, in many independent variables, which or those have linear relationship with the dependent variable? This requires a separate test of each regression coefficient, and an independent variable that has not passed the test is removed from the model.
The test method is also to construct T-Statistic for T-Test
SBI is the standard error of the I-biased regression coefficient, which refers to the degree of variation of the regression coefficient.
For the regression equation yi=β0+β1x1+β2x2+...+βixi+μi, taking the regression coefficient β1 as an example, the original hypothesis is presented.
h0:β1 equals 0.
H1:β1 not equal to 0
Based on significance level α query threshold TΑ/2 (n-m-1)
If |T|>TΑ/2 (n-m-1), the H0 is rejected, and the coefficient β1 is not equal to 0
If |T|<TΑ/2 (n-m-1), accept H0, think the coefficient β1 equals 0
====================================================
Filtering of independent variables
Multiple linear regression involves multiple independent variables, and not all of these independent variables will affect the dependent variable, so it involves the filtering problem of the independent variables, the independent variables that affect the dependent variables are introduced into the regression model, and the remainder is removed. After elimination of the independent variables that have not passed the test, it is necessary to reconstruct the regression equation for the remaining data and start the test again, encounter the independent variables that have not passed the test, and then re-examine the regression equations for the rest of the data, and then repeat, until all the independent variables in the model are tested, this process is called variable screening.
There are two principles to follow for variable filtering:
First, as far as possible without missing the important variables.
Second, minimize the number of independent variables, keep the model streamlined.
There are several methods of variable selection currently used in the following
1. Forward Selection method
Set an entry and reject standard (P-value), start the regression equation does not have the independent variable, the k variable and the dependent variable to synthesize a simple linear regression model, a total of K, in a statistically significant and conform to the criteria of the P-equation, the choice of the minimum P-value or the dependent variable contribution to the largest of the self-variable XI On the basis of the introduction of Xi, we can fit the simple linear regression model of the K-1 independent variable, and then introduce the model with the least P value or the independent variable which contributes the most to the dependent variable, so repeated, a variable may not conform to the entry standard at the beginning, but as the variable entered in the regression equation increases gradually, It is possible for the variable to conform to the entry criteria so that the variables are no longer fitted to a statistically significant and simple linear model that conforms to the entry criteria.
Limitations of the forward selection method:
Entering the standard selection over an hour may lead to the entry of important variables, when the selection is too large, it may lead to the beginning of the variable late to become no statistically significant and can not be tested.
2. Backward Selection method
Set an entry and reject standard (P-value), all the independent variables and the dependent variables to fit a regression equation, and test, in the self-variable significance test, there will be no statistically significant and conform to the rejection criteria of the independent variables, according to the maximum P-value or the contribution of the minimum principle of the dependent variable to reject, each remove a variable The regression equation is re-examined, so repeated, until the independent variables in the regression equation are all statistically significant and do not conform to the rejection criteria.
Limitations of the backward selection method:
Rejecting the standard selection over an hour may result in a variable that is initially rejected, even if it contributes significantly to the dependent variable at a later stage, and cannot be reintroduced into the model. When the rejection criteria is too large, the variable may not be effectively rejected.
3. Stepwise regression method
Forward hair only enter, backward method only out, there are limitations, and the progressive regression law is a combination of forward and backward, can be divided into progressive regression and backward stepwise regression, the first step forward regression as an example:
Set an entry and reject standard (P-value), starting with the forward method, there is no independent variable in the regression equation, the K-variables are combined with the dependent variable to synthesize simple linear regression model, total K, in the statistically significant and conform to the standard p equation, Select the least P-value or the self-variable that has the greatest contribution to the dependent variable XI into the model, on the basis of the introduction of Xi, and then respectively fit the simple linear regression model of the K-1 independent variable, and then the least P-value of the dependent variable and the largest contribution of the self-variable XJ introduced the model, In this model there will be two independent variables Xi and XJ, at this time in accordance with the back method of the only two independent variables of the regression model test, will not be statistically significant and conform to the exclusion criteria of the independent variables, according to the maximum P-value or the contribution of the minimum principle of the variable to reject, and then back to the variable introduction phase, so , statistically significant independent variables can be introduced into the model, and all variables in the model are statistically significant.
Stepwise regression method Each introduces an independent variable, the whole model is re-tested, will not meet the requirements of the independent variable culling, so that there are in and out.
The limitation of the stepwise regression method is that the entry and culling are only standard with the set P value, sometimes out of the actual.
4. Comprehensive analysis Method (optimal subset method)
As the name implies, that is, from all possible combinations of variables to pick the best, assuming that there are k variables, a variable has two states: preserving or culling, then there will be a total of 2k-1 equation, from which the residual standard deviation is the smallest, this method is the most accurate, but the disadvantage is that the computational volume is too large.
5. Other methods
In addition to the above methods, there are maximum R2 increment method, minimum R2 increment method, R2 selection method, modified R2 selection method, and Mallow ' s CP selection method.
In summary, there are many ways to choose an independent variable, but the final choice of the optimal regression model usually conforms to the following principles
1. Regression models are statistically significant in general
2. The hypothesis test results of the estimates of the parameters in the regression model are statistically significant.
3. The positive and negative signs of each parameter in the regression model match the professional meaning of the variable.
4. It is of professional significance to calculate the predicted value of the dependent variable according to the regression model.
5. If there are many better regression models, it is best to take the residual sum of squares and smaller ones with fewer variables.
==================================================================
Multi-collinearity problem
Multiple linear regression encounters a problem where there is a correlation between multiple independent variables, which makes the estimation of the regression coefficients unstable, the accuracy of the predicted values reduced or even the important variables are not introduced into the model,
Causes of multiple collinearity may be
1. Research design is not reasonable
2. Problems with data collection
3. There is a correlation between the arguments themselves
4. Anomalies in the data
5. Fewer samples and more variables
There are several ways to judge the multiple collinearity:
1. Correlation coefficient
22 variables to calculate the correlation coefficient, if the correlation coefficient below 0.7, generally does not appear too big problem
2. Tolerance (tolerance)
The closer the value is to 0, the greater the collinearity
3. Variance expansion Factor vif
4. Feature root (Eigenvalue)
When the feature root tends to 0, there is a collinearity between the arguments
5. Condition index CI (Condition index)
The square root of the ratio of the maximum characteristic root to each remaining characteristic root, called the condition exponent
The condition exponent is between 10-30 and is considered to be moderate collinearity, and if the condition exponent is greater than 30, it is considered to be a serious multiple collinearity.
Solving multi-collinearity problem
1. Variable filtering
The existence of collinearity can be avoided to some extent through variable filtering
2. Using ridge regression or principal component regression analysis
Sometimes it is found that there are multiple collinearity in the argument, but it is not advisable to use the least squares method to estimate the model when it is not recommended, and the two estimates are biased by using the ridge regression or principal component regression analysis.
3. Increase the sample size
By increasing the sample size, the estimation accuracy can be improved, and the multiple collinearity is pick-up to a certain extent.
========================================================
Analysis steps for multiple linear regression
1. Investigate the correlation between variables
Multiple linear regression is to satisfy the linear relationship between the independent variable and the dependent variable, so the first step is to determine how the relationship between the variables, can do a matrix scatter plot to investigate, at the same time, the matrix scatter plot can also find the anomaly, the anomaly of the multiple linear regression of the parameter estimation has a large impact, need to be found and processed.
2. Survey data distribution
To investigate the problems such as the normality of the data or the homogeneity of variance, if it does not conform to the data transformation, the data transformation will lead to the change of related relations, and the related relationship needs to be re-examined.
3. Preliminary modeling
Preliminary modeling of data, including variable filtering
4. Model diagnosis, residual analysis
After modeling, the model needs to be tested and calibrated to ensure that the best model is fitted, that is, the significance test and residual analysis. Residual analysis is mainly to test the residual difference between the linearity, independence, normality, homogeneity, independence is generally used durbin-watson residual sequence correlation test to judge, the statistic value of 0-4, if the result is about 2, you can determine the residual difference independent, if close to 0 or 4, The residuals may be relevant. Normality can be judged by a histogram. Linearity and variance homogeneity can be judged using residual plots, residual plot is the regression equation of the independent variable (or due to the value of the variable) as the horizontal axis, the residual as the ordinate, the residuals of each of the independent variables are plotted on the planar coordinates of the scatter plot, if the residual plot is evenly distributed in the center of the 0, parallel to the horizontal belt area, It can be considered that the requirements of linear and variance homogeneity are basically satisfied.
Linear model (3)--Multiple linear regression model