Machine learning from Statistics (II.) Some thoughts on multiple collinearity

Source: Internet
Author: User

From the phenomenon of a life: when we installed, we will not install more than one decompression software, do not want to be inexplicably installed additional butler. In contrast, we will install a variety of players. So, what is this for? Of course, you can also think of such a problem, praise the software so much, hard disk is big enough, why not all installed? Seeing the second question, the idea seems clear. Very simple, decompression software, Butler function is similar, and all free, even do not need to consider exactly which, it seems as long as there is a good. However, the player has a single machine, the network of points. Even if you are a network player, you will still find different resources ... Even if the player looks the same.

In 1996, Tim C. Hesterberg, who was still reading, asked Stanford University statistician Bradley Efron the question: "What is the most important issue in the field of statistics?" He thought Efron would answer the bootstrap algorithm that gave him a lofty status, and the answer to Efron was variable selection.

With different software analogy different variables, installed into a variable choice problem, choose the right software to maximize customer satisfaction.

1. Starting with multivariate linear regression modeling

  We consider the question of user satisfaction Assessment: a software satisfaction rating (0~5) and expert score (0~5 points). User ratings have subjective factors, different users of the same software score is generally different. Experts review the sub-system to consider the product, technology and other aspects, more objective, that the same software scoring can be considered the same. We use the product of user ratings and expert ratings to represent the satisfaction of a user with a software. Product can synthesize user experience and real value, and can make the two restrict each other, more can reflect the real user satisfaction. So, if we have the overall satisfaction of users and their scoring data on the software, can we estimate the expert score of each software by the data? Is it possible to get the software objectively by establishing a regression model?

  Assume that the number of users is n the number of software is p. Based on the subjective and objective scoring assumptions above, we use the column vector y= (y1,y2,..., yn) t to indicate the satisfaction of each user; matrix x has n rows p column, and I for user I P software scoring xi= (Xi1,xi2,..., XIP) The expert score is expressed as β= (β1,β2,..., βn) T. It is obvious that the model of satisfaction score problem is the y=xβ+ε,ε represents the error term. We want to estimate beta with a set of data that contains user satisfaction-scoring. It seems theoretically possible to draw an objective level of software pros and cons.

2. A simple simulation experiment

We still use the least squares method, which is the least squares method using multivariate linear regression, compared to the previous one. However, the same reason, the estimated value of the vector β, each user scored to get the user satisfaction estimate, and the total user satisfaction estimates and the real value of the error squared and the minimum.

Back to question one of the first paragraph, let's assume that only two compression software x1,x2 are studied. According to the hypothesis of problem one, because users feel that they are not different, so the score of two software is very close. However, experts from the objective level of evaluation of the two or there is a certain difference, the score of two software is β1=3.5 and β2=4.5. Then the real model of user satisfaction is y=3.5x1+4.5x2+ε. Assuming there are 10 users, and there are some differences between users, the following table is made according to this paragraph:

User 1 2 3 4 5 6 7 8 9 10
X1 Ratings 1.7 2.0 2.3 2.5 2.7 3.3 3.8 4.3 4.6 4.9
X2 Ratings 1.7 2.2 2.4 2.6 2.9 3.1 4.0 4.1 4.8 5.0
Εi 0.9 -0.5 0.5 -0.6 0.4 2.1 1.7 0.4 -1.7 -1.3
Yi 14.50 16.40 19.35 19.85 22.90 27.60 33.00 33.90 36.00 38.35

  In the R language, we build a multivariate linear regression model to estimate the β1,β2:

X1 <-C (1.7, 2.0, 2.3, 2.5, 2.7, 3.3, 3.8, 4.3, 4.6, 4.9<-C (1.7, 2.2, 2.4, 2.6, 2.9, 3.1, 4.0, 4.1, 4.8, 5 .0<-C (0.9, -0.5, 0.5, -0.6, 0.4, 2.1, 1.7, 0.4, -1.7, -1.3<-3.5 * x1 + 4.5 * x2 +<- LM (y ~ x1 + x2) Summary (model)

Residuals:
Min 1Q Median 3Q Max
-0.9534-0.6940-0.2868 0.4040 2.2507

Coefficients:
Estimate Std. Error t value Pr (>|t|)
(Intercept) 1.7315 1.1518 1.503 0.1765
X1 7.0096 2.3947 2.927 0.0221 *
X2 0.5953 2.4084 0.247 0.8119
---
Signif. codes:0 ' * * * ' 0.001 ' * * ' 0.01 ' * ' 0.05 '. ' 0.1 "1

Residual standard error:1.125 on 7 degrees of freedom
Multiple r-squared:0.987, adjusted r-squared:0.9833
f-statistic:265.9 on 2 and 7 DF, p-value:2.501e-07

the real model is: y=3.5x1+4.5x2+ε.

The least squares regression model is: y=7.0096x1+0.5953x2+1.7315.

The difference between the two is so great, but the adjusted regression coefficient (adjusted r-squared) of the model is 0.9833, the model is seriously distorted, but the fitting effect is still very good. Of course, there is no constant entry in our real model, and if so, I'm afraid the coefficient β2 of X2 will be evaluated as negative. Clearly, the previous ratings were set between 0~5, so the results of the regression would be more apparent false. So, why is this so? The reason is simple: there are highly correlated variables in the data (up to 0.987 of the x1,x2 correlation), and the two variables are so similar, like two parallel vectors, that is, they're collinear . Popular, because two software is too similar, so that cannot judge who can contribute greater user satisfaction, the two 10:0 open, 5:5 Open, 0:10 open almost no difference.

As can be seen from the above results, the standard error of β1 reached 2.3947 and β2 reached 2.4084. Even if the estimation of the least squares is unbiased, it is no longer valid. In the previous article we said that the least squares method is the best in unbiased estimation. This also implies that in order to estimate the effectiveness, we will sacrifice the unbiased to a certain extent in exchange for validity, the use of Ridge regression, principal components and other biased methods.

3. Multi-collinearity is a common existence

  in the in the multivariate regression problem of some cross-sectional data of statistics or machine learning, X is the two-dimensional matrix of NXP, but usually p?n. This is the high-dimensional complex data that introduces a lot of variables because they usually have information, but the researchers don't know exactly how many variables are useful. variables, variables and variables will be collinear, it is also easy to have a variable by a number of other variables linear representation of the situation (a miscellaneous software is the other vertical subdivision but the function of specialized software replacement), the substitution is called multiple collinearity . We need to make variable selection, otherwise it is easy to cause false regression.

  Professor Fan Jianqing, editor-in-chief of theAnnals of Statistics magazine, a tenured professor of statistics and financial engineering at Princeton University, has done such a simulated experiment in a paper:

Randomly generated samples of N=50,Z1,..., zp~i.i.d, calculate the maximum and distribution of the absolute value of Z1 and zj,j≥2 at p=1000 and p=10000 respectively (left), and the maximum and distribution of the multiple-correlation absolute value of Z1 and the other 5 variables ( Figure right). It is not difficult to find that, whether 1000 variables or 10,000 variables, the randomly simulated variables are almost no collinear with the Z1, that is, almost no correlation with the Z1 height. Even if the number of variables increases by 10 times times, there may not be much increase in the likelihood of higher correlations. However, the linear combination of any of the 5 non-Z1 variables from the randomly simulated 1000 variables is easily correlated with the Z1 height, i.e. multiple collinearity is produced. When the number of variables reaches 10000, the probability of multiple collinearity occurring is greater, and the correlation is generally enhanced.

Obviously, whether it is 1000 variables or 10,000 variables, the number of variables is not very large relative to the actual problem. Under the stochastic simulation experiment, the multiple collinearity of high dimensional data will be 100%, besides, the actual problem will be random?

Machine learning from Statistics (II.) Some thoughts on multiple collinearity

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.