Machine learning--PCA reduction and lasso algorithm

Source: Internet
Author: User

1, PCA reduced dimension

What is the role of dimensionality reduction?
Data is easier to handle and easier to use in low-dimensional environments;
The relevant characteristics, especially the important features, can be clearly displayed in the data, if only two or three-dimensional, it is more convenient to visualize;
Data noise Removal
Reduce algorithmic overhead

Common dimensionality reduction algorithms include principal component analysis (principal component ANALYSIS,PCA), factor analysis (Factor analyses), and independent component analysis (independent component Analysis,ica), PCA is one of the most widely used methods at present.

In PCA, the data is converted from the original coordinate system to the new coordinate system, and the selection of the new coordinate system is determined by the data itself. The choice of the first axis is the most important direction of the original data, from the data point of view, this is actually the most significant direction,

That is, the direction of the total line B. The second axis is the first vertical or orthogonal (orthogonal) direction, which is the direction of the middle line C. The process repeats repeatedly and repeats the number of features in the original data.

The data characteristics expressed in these directions are called "Principal components".

Principal Component Analysis (PCA) is the most commonly used linear dimensionality reduction method, whose goal is to map high-dimensional data to low-dimensional spaces by some linear projection, and expect the data to be the most variance in the projected dimension,

This uses less data dimensions while preserving more of the original data points.

Popular understanding, if all points are mapped together, then almost all the information (such as the distance between points and points) is lost, and if the mapping behind the difference as large as possible, then the data points will be scattered, so as to retain more information. It can be proved that PCA is a linear dimensionality reduction method which loses the least information of raw data. (actually closest to the original data, but PCA does not attempt to explore the data intrinsic structure)

2. Lasso algorithm

Reference from: http://blog.csdn.net/slade_sha/article/details/53164905

Let's look at a wave of fitting:

In the picture, the red line has obvious overfitting, the Green Line is the reasonable fitting curve, in order to avoid overfitting, we can introduce regularization.

The following can be used to solve the curve fitting process of the cross-fitting occurs, the existence of the root mean square error is also called standard error, that is, √[∑di^2/n]=re,n is the number of measurements, DI is a set of measured values and true value deviation.

We need to take the error term into account in the actual consideration of the regression process.

This is similar to the simple linear regression formula, and when the regularization is optimized to fit the thing, a constraint is added, namely the penalty function:

Here, this penalty function has a variety of forms, more commonly used have L1,L2, probably the following several:

Let's talk about two commonly used cases, q=1 and q=2:

Q=1, that is to say today lasso return, why Lasso can control the fitting, because in the process of data training, there may be hundreds of, or thousands of variables, and then too many variables to measure the target function of the dependent variable, may result in excessive interpretation, By q=1 the penalty function to limit the number of variables, you can first filter out some of the variables that are not particularly important, see:

As long as it is not a special case with the edge of the square tangent, must be with a vertex priority intersection, it must exist in the vertical axis of a factor of 0, play a role in the selection of variables.

q=2, in fact, can be regarded as the blue circle above, in this round limit, the point can be any point on the circle, so q=2 is also called Ridge regression, Ridge regression is not the role of compression variables, in this figure can be seen.

Lasso regression:

The characteristic of lasso regression is that when the generalized linear model is established, the generalized linear model includes one-dimensional continuous dependent variable, multidimensional continuous dependent variable, nonnegative number dependent variable, two-yuan discrete dependent variable, multivariate discrete dependent change, in addition, whether the dependent variable is continuous or discrete, lasso can be processed, in general, Lasso The requirements for data are extremely low, so the degree of application is very wide, besides, Lasso can also filter the variables and reduce the complexity of the model. The variable filter here means that all the variables are not fitted into the model, but that the variables are selectively put into the model to get better performance parameters. Complexity adjustment is the control of the complexity of the model through a series of parameters, thus avoiding overfitting (Overfitting). For linear models, the complexity is directly related to the number of variables in the model, the more the number of variables, the higher the complexity of the model. More variables can often give a seemingly better model at the time of fitting, but also face the danger of overfitting.

The complexity of the lasso is controlled by λ, and the greater the penalty for a linear model with more variables, the more likely it is that a model with fewer variables will eventually be obtained. In addition, another parameter α controls the behavior of the model when dealing with high-correlation (highly correlated) data. Lasso regression Α=1,ridge regression α=0, which corresponds to the form and purpose of the penalty function.

Machine learning--PCA reduction and lasso algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.