Regression refers to the use of a sample (known data) to produce a fitted equation to predict (unknown data).
Use: Predict and discriminate rationality.
Difficulty: ① selected variables (multivariate), ② avoids multiple collinearity, ③ observes fitting equations, avoids overfitting, ④ tests the rationality of the model.
The relationship between the dependent variable and the independent variable: ① correlation (non-deterministic relationship, such as the correlation between physical and chemical scores), using correlation coefficients to measure the strength of linear correlations; ② function relationship (deterministic relationship)
Correlation coefficient solution: the correlation coefficient of Pearson sample accumulation moment
Note that if the sample is two pairs of sequential data, the Spearman level correlation factor (rank-dependent or ranking -related) is used.
In the formula, the rank (from large to small to large) is represented separately.
The application of least squares in linear regression
Judging the degree of linear fitting, if it is through the point to the straight line, from the analytic geometry point to the distance formula of the straight line, it involves the root, so it is not good to find the extremum, so instead of the point to straight line to the vertical line to find the length, to absolute value.
This regression error/residuals squared sum (two multiplier)
In order to make the two multiplier RSS minimum, the minimum value of RSS is asked, which is called the least squares method.
Solve the two-yuan equation group and get the estimate of A and B.
Note: Regression problem is good at interpolation, but not good at extrapolation, in the use of regression model to make predictions should pay attention to the range of values applicable to X.
(1) Multivariate linear regression model
① determining coefficient (the degree to which the model interprets the sample data)
② regression coefficient test statistics (significance of variables)
Test statistics of fitting degree of ③ linear regression equation (model fitting degree)
④ Simple linear regression (unary), sample Pearson correlation coefficient
(2) Multivariate linear regression model with virtual variables
If the direct definition of yellow, white, black, respectively, is a three-way, this is wrong
The virtual variable plays a role in adjusting the intercept here.
(3) stepwise regression
Forward Introduction method: From the beginning of the unary regression, gradually increase the variables, so that the indicator values to achieve optimal;
Backward culling method: From the full-variable regression, the gradual deletion of a variable, so that the indicator value ... ;
Stepwise filtering method: Simultaneous forward and backward removal
(4) Regression diagnosis
If the ① sample conforms to the normal distribution hypothesis and if it does not, then the test and interval prediction cannot be done because many test and prediction methods are based on the assumption of the normal distribution;
② the existence of outliers causes the model to produce large errors, such as input errors;
Whether the ③ linear model is reasonable;
Whether the ④ error satisfies the hypothesis of independence, equal variance and normal distribution, i.e. it will not change with the change of Y, and the error term is not affected by Y ;
⑤ whether there are multiple collinearity, which causes the matrix determinant value to be 0, the inverse of the matrix tends to infinity, and the coefficients of the multivariate regression model become larger.
The corresponding workaround:
① quasi-fit test, chi-square statistics;
② Scatter chart observation, etc.;
Whether the ③ statistic is reasonable;
④ residual plot is reasonable;
a method of ⑤ stepwise regression to solve multiple collinearity
(5) Multi-collinearity
If there is a multiplicity of collinearity, at least one eigenvalue is approximately close to 0.
The resulting vector, which is centered and normalized, is recorded
Therefore, if there are multiple collinearity, there is no way to solve, or the solution results are unstable.
The instability of the model (low robustness), when the data changes in a small point, the result will be very large changes, such as the coefficient is found to be large, tens of millions of, millions of, the coefficient of positive and negative signs will often occur shear.
(Note: There are two reasons for the singularity of the Matrix: the number of ① variables is more than the sample, ② appears multiple collinearity.) )
Multiple collinearity Metrics
How to find out which variables are multi-collinearity
Linear regression algorithm