A brief discussion on linear regression

Source: Internet
Author: User

Statistical learning is fun, but not studious. Whether textbooks, handouts or papers, are flooded with great length formula, of course, this is the need for scientific literature rigor, but for beginners, learning threshold is too high. Recently in the reading "Statistical Learning elements" (the Elements of statistical learning), the experience, for reference only.

Chapter III-LINEAR regression

If we know that the linear relationship between X and Y is satisfied, how can we estimate the linear coefficient w between x and Y after we get the data of the X.

This is actually the solution of a linear equation set, XW = Y. The meaning of this equation is to use w as the weighted X-column vector to get y. But theoretically, if Y is not in the column space of x, the equation is non-solvable, and this happens often because we live in a world of error. On the one hand, y may be measured by us, but there will be errors in all measurements, and on the other hand, the relationship between x and Y is not linear, and you will be able to get a solution to a linear model. Although we cannot get a perfect solution, we can still approach it in this direction, so our goal is to look for W, which makes the deviation between XW and y squared and the smallest, which is the origin of the least squares.

The least squares solution can be obtained from two angles, one is the angle of the algebra, the objective function is biased and the derivative is zero, the second is the angle of the geometry (I like the intuitive thing), the y projection into the X column space, and then the projection as Y, solve the equation.

After getting W, try to test it. The reason is that when we choose X, we may accidentally choose a few related features, and this set of features may only need one to suffice. So after we get w, we always wonder if we can get rid of some of the components in X, reduce the complexity of the model, and correspond to the value of W, which is to set some of the components to zero. This is the subset selection method of linear regression model.

How do I choose this subset? The most intuitive approach is brute force search, where all combinations of these features are tried over, selecting the most satisfying feature group. The complexity of this approach is exponential. An improved direction is the greedy algorithm, which is divided into forward and reverse. Positive greed, that is, starting from 0, each time to add to the feature pool the most should be added to the feature, here "should" refers to, and other non-added features, the inclusion of this feature makes the error is the most significant. The reverse greedy, is to start from the complete set, each time from the feature pool out of the most should be taken out of the characteristics, what is "should"? Refer to the above.

In addition to the subset selection method, the next is the famous ridge and lasso. Ridge's understanding has four aspects, first, the least squares may not have the solution, because X ' x is not necessarily reversible, so it is changed to X ' X+λi; second, the Bayesian point of view, the W takes the Gaussian priori, then the objective function of the ridge is the posterior probability; Thirdly, it is understood from the perspective of regularization, It is equivalent to adding a regular term to the target function; The last is the most intuitive, by SVD decomposition, we can get X of a set of orthogonal vector group, the least squares solution, in fact, the first Y projection to this orthogonal vector space, and then solve the equation, here, each orthogonal vector corresponds to a coefficient (matrix x singular value), Coefficients represent the weights of orthogonal vectors, ridge is actually the coefficients of all orthogonal vectors are compressed, but, if the orthogonal vector primitive coefficient is relatively large, the compression is lighter, if the original coefficient is relatively small, the compression is a little heavier.

Another solution to the problems related to the input characteristics is to first group the input features linearly, get an intermediate vector, and then use this intermediate vector as the basis for linear regression. The main component method is the selection method of intermediate vectors. Its core and ridge method is similar, the difference is that ridge is the main components to do compression, the least important ingredient, the greater the intensity of compression, and the main component method is directly the least important features discarded.

In the final analysis, the most commonly used is the least squares and ridge. The chart in the original text is very good, recommended to read.

A brief discussion on linear regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.