Linear regression algorithm

Source: Internet
Author: User

Regression refers to the use of a sample (known data) to produce a fitted equation to predict (unknown data).

Use: Predict and discriminate rationality.

Difficulty: ① selected variables (multivariate), ② avoids multiple collinearity, ③ observes fitting equations, avoids overfitting, ④ tests the rationality of the model.

The relationship between the dependent variable and the independent variable: ① correlation (non-deterministic relationship, such as the correlation between physical and chemical scores), using correlation coefficients to measure the strength of linear correlations; ② function relationship (deterministic relationship)

Correlation coefficient solution: the correlation coefficient of Pearson sample accumulation moment


Note that if the sample is two pairs of sequential data, the Spearman level correlation factor (rank-dependent or ranking -related) is used.


In the formula, the rank (from large to small to large) is represented separately.

The application of least squares in linear regression

Judging the degree of linear fitting, if it is through the point to the straight line, from the analytic geometry point to the distance formula of the straight line, it involves the root, so it is not good to find the extremum, so instead of the point to straight line to the vertical line to find the length, to absolute value.

This regression error/residuals squared sum (two multiplier)

In order to make the two multiplier RSS minimum, the minimum value of RSS is asked, which is called the least squares method.

Solve the two-yuan equation group and get the estimate of A and B.

Note: Regression problem is good at interpolation, but not good at extrapolation, in the use of regression model to make predictions should pay attention to the range of values applicable to X.


(1) Multivariate linear regression model

① determining coefficient (the degree to which the model interprets the sample data)


② regression coefficient test statistics (significance of variables)

Test statistics of fitting degree of ③ linear regression equation (model fitting degree)

④ Simple linear regression (unary), sample Pearson correlation coefficient

(2) Multivariate linear regression model with virtual variables


If the direct definition of yellow, white, black, respectively, is a three-way, this is wrong

The virtual variable plays a role in adjusting the intercept here.

(3) stepwise regression

Forward Introduction method: From the beginning of the unary regression, gradually increase the variables, so that the indicator values to achieve optimal;

Backward culling method: From the full-variable regression, the gradual deletion of a variable, so that the indicator value ... ;

Stepwise filtering method: Simultaneous forward and backward removal

(4) Regression diagnosis

If the ① sample conforms to the normal distribution hypothesis and if it does not, then the test and interval prediction cannot be done because many test and prediction methods are based on the assumption of the normal distribution;

② the existence of outliers causes the model to produce large errors, such as input errors;

Whether the ③ linear model is reasonable;

Whether the ④ error satisfies the hypothesis of independence, equal variance and normal distribution, i.e. it will not change with the change of Y, and the error term is not affected by Y ;

⑤ whether there are multiple collinearity, which causes the matrix determinant value to be 0, the inverse of the matrix tends to infinity, and the coefficients of the multivariate regression model become larger.

The corresponding workaround:

① quasi-fit test, chi-square statistics;

② Scatter chart observation, etc.;

Whether the ③ statistic is reasonable;

④ residual plot is reasonable;

a method of ⑤ stepwise regression to solve multiple collinearity

(5) Multi-collinearity

If there is a multiplicity of collinearity, at least one eigenvalue is approximately close to 0.

The resulting vector, which is centered and normalized, is recorded

Therefore, if there are multiple collinearity, there is no way to solve, or the solution results are unstable.

The instability of the model (low robustness), when the data changes in a small point, the result will be very large changes, such as the coefficient is found to be large, tens of millions of, millions of, the coefficient of positive and negative signs will often occur shear.

(Note: There are two reasons for the singularity of the Matrix: the number of ① variables is more than the sample, ② appears multiple collinearity.) )

Multiple collinearity Metrics


How to find out which variables are multi-collinearity


Linear regression algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.