Ridge Regression Ridge Regression statistical model

Source: Internet
Author: User

Ridge regression is used to deal with the following two types of problems:

1. Number of positions less than the number of variables

2. There is a collinearity between variables

There is a collinearity between the variables, the coefficients of the least squares regression are unstable and the variance is very large, because the matrix of the coefficient matrix X and its transpose matrix cannot be reversed, and ridge regression by introducing the LAMDA parameter, the problem is solved. In the R language, functions in the mass package Lm.ridge () can be easily completed. Its input matrix X is always n*p, regardless of whether it contains a constant term.

When a constant term is included, the function centers the Y and takes the mean value of y as a factor to center and normalized the x, taking the mean and standard deviation of each variable as the factor. So after the x and y processing, the mean value of x and Y is 0, which causes the regression plane to pass through the origin, that is, the constant term is 0. Therefore, although the containing constant term is specified, the coefficients given by the Lmrige Coef do not have a constant entry value. When using the model for forecasting, it is also necessary to center and normalized x and Y, the factor is the use of training time to center and normalized factors, and then multiplied by the coefficients of the predicted results, it should be noted that, if the model is established in the Command Line window directly input Lmridge, There will also be a set of coefficients, the coefficient will contain constant term, this coefficient and the model given by the coefficient Lmridge Coef, because it is not normalized and centralized data, the prediction can be used directly, do not need to be normalized and centralized data.

When the specified model does not contain a constant entry, the model assumes that the mean value of each variable is 0, so that the x and Y are not centered, because you want to emphasize through the origin. But the normalization of X, and the normalized factor is also the standard deviation of the variable with the assumption that the mean value is 0. When making predictions, if the lmridge$coef coefficients are used, the data needs to be normalized. If you use the coefficients directly given by Lmridge, you simply multiply them directly.

Ridge regression LAMDA Choice: You can use Select (Lmridge) for automatic setting, generally use the GCV minimum value, LAMDA range is greater than 0.

The principle of ridge regression

Ridge regression is a kind of biased estimation regression method, which is specially used in collinearity data analysis. In essence, an improved least squares estimation method, by giving up the unbiased of least squares, taking the loss of part information and reducing the precision as the cost, the regression coefficient is more realistic and reliable. The tolerance of morbid data is much stronger than that of least squares.

The principle of ridge regression is more complicated. According to Gausmarkov, multiple correlations do not affect the unbiased and minimum variance of least squares estimator, however, although the least squares estimator is the smallest variance in all linear estimators, this variance is not necessarily small, but it can actually find a biased estimator, which has a smaller bias, But its precision can be much higher than the unbiased estimate. Ridge regression analysis is based on this principle, by introducing the regression estimator in the normal equation with partial Changshu.

Disadvantage: Usually the R-squared value of the ridge regression equation will be slightly lower than the ordinary regression analysis, but the regression coefficient is often significantly higher than the ordinary regression, in the existence of collinearity and pathological data in the study of more practical value.

Gauss-Markov theorem

In statistics, the Gauss-Markov theorem states that:

In the linear regression model with the error 0 mean, same variance, and not correlated, the best unbiased linear estimation of the regression coefficient blue is the minimum variance estimate.

Generally, the blue of any linear combination of regression coefficients is its minimum variance estimate.

In this linear regression model, the error does not need to assume a normal distribution, nor does it need to assume independence (but it needs to be unrelated to this weaker condition), and not to assume the same distribution.

Specifically, suppose

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.