1 multivariate linear regression model 1 multivariate regression model and regression equation
Multivariate regression model:
y=β0 +β1 x 1 +β2 x 2 +...+βk x k +ε
Multivariate regression equation:
Multiple regression equations for E (y) =β0 +β1 x 1 +β2 x 2 +...+βk x k 2 Estimator
Multivariate regression equation of estimation:
Y ^ =β0 ^ +β1 ^ x 1 +β2 ^ x 2 +...+βk ^ x K 3 parameter of least squares estimation
The parameters in the regression equation are obtained by the least squares method, i.e. the residual sum of squares and the smallest
Q=∑ (y i−y i ^) 2 =∑ (y i−β0 ^−β1 ^ x 1−β2 ^ x 2− ...) −βk ^ x k) 2 The fitting degree of the regression equation 1 multiple weight coefficient
The multiple regression coefficients are the same as the sum of the sum of squares in the regression. However, when the number of independent variables increases, the prediction error becomes smaller, thus reducing the residual square and Sse,r 2 will become larger. That is, if the independent variable is added to the model, even if the argument is not statistically significant, R 2 will become larger. Therefore, it is proposed to adjust R 2 with the number k of the sample quantity n and the independent variable, and to calculate the multiple judging coefficients of the adjustment, the formula is:
R 2α=1− (1−r 2) (n−1n−k−1)
The square root of R 2 is called the multiple correlation coefficient, also called the complex correlation coefficient, and the relationship between the metric dependent variable and k independent variable. 2 Estimated standard error
As with the same element regression, the estimated standard error in multivariate regression is also an estimate of the variance σ2 of the error term ε, and the calculated formula is:
S e =∑ (y i−y i ^) 2 n−k−1−−−−−−−−−−√=ssen−k−1−−−−−−−−√=mse−−−−−√
Meaning: The average predictive error of the variable y is predicted according to the argument x 1, x 2,..., x K. 3 significance test 1 linear relationship test
The main test is the linear relationship between the dependent variables and the variables, and in K independent variables, as long as there is an independent variable and the dependent variable is significant. Steps are:
(1) Making assumptions
H 0: β1 =β2 =...=βk =0
H 1: β1, β2,..., βk at least one is not equal to 0
(2) Calculate the statistic quantity
f=ssr/ksse/(n−k−1) ∼f (k,n−k−1)
(3) Making statistical decisions
If F>fα, the original hypothesis is rejected; 2 regression coefficient test and inference
method is similar to the same element regression:
(1) Making assumptions. For any parameter βi (i=1,2,..., k), there are
H 0: Βi =0h 1: βi≠0
(2) Calculate the statistic quantity
t i =βi ^ sβi ^∼t (n−k−1)
(3) Making statistical decision 4 multi-weight collinearity 1 Multi-weight collinearity and the problems caused by it
When two or more of the independent variables in the regression model are related to each other, there is a multiplicity of collinearity in the regression model.
Problems arising from multiple collinearity:
(1) When the variables are highly correlated, they may cause confusion to the results of the regression and even lead to a misleading analysis.
(2) Multiple collinearity may have an effect on the positive and negative numbers of parameter estimates. 2 discriminant of multi-collinearity
Multiple collinearity may occur if the following conditions are true:
(1) There is a significant correlation between the independent variables in the model
(2) When the linear relation test (f test) in the model is significant, the T-test of almost all regression coefficients βi is not significant.
(3) The treatment of the positive and negative numbers of the regression coefficients and the inverse of the expected 3 multi-weight collinearity problem
According to the severity of multiple collinearity, select the appropriate method:
(1) Remove one or more dependent variables from the model so that the retained arguments are not as relevant as possible
(2) If you want to keep all the arguments in the model, you should:
A. Avoid testing a single parameter β based on t statistics
B. The inference (estimation or prediction) of the variable y value is limited to the range of the independent variable sample value 5 using regression equation to predict
Multiple reliance on software 6 variable selection and stepwise regression
If you can filter the collected variables before you build the model, remove the unnecessary arguments, not only makes it easy to build the model, but also makes the model more operable and easier to interpret. 1 Variable selection process
The principle of choosing a variable is usually a significant test of statistics, according to whether the residual squared sum (SSE) is significantly reduced when one or more independent variables are introduced into the regression model. The sum of squares of residuals can be reduced by introducing the statistic F.
The main methods of variable selection are: Forward selection, backward culling, stepwise regression, optimal subset 2 forward selection
The forward selection method starts with no arguments in the model, and then selects the arguments to fit the model using the following steps.
(1) The K-variables are fitted to the linear regression model of the dependent variable y respectively, and the model with the highest F statistic is found and its independent variable is introduced into the model.
(2) on the basis of the introduction of X I, consider the remaining k−1 linear regression model, choose the optimal, repeated, until the model has no statistical significance of the independent variables so far. 3 Back Culling
In contrast to the forward selection method, the basic process is:
(1) The linear regression model with all k variables is fitted first, and then the model of an independent variable is considered, so that the variable of SSE value of the model is eliminated from the model.
(2) to fit the remaining variables and repeat (1) until a single variable is eliminated so that SSE will not decrease significantly. 4 stepwise regression
With the combination of the forward selection method and the backward elimination method, the stepwise regression process is the forward selection method to continuously add the variable and consider the possibility of removing the previously added variable, until the addition of the variable can not lead to a significant reduction in SSE.