Scikit-learn provides a lot of class libraries for linear regression, which can be used to do linear regression analysis, This article summarizes the use of these libraries, focusing on the differences of these linear regression algorithm libraries and their respective usage scenarios.
The purpose of linear regression is to obtain the linear relationship between the output vector \ (\mathbf{y}\) and the input feature \ (\mathbf{x}\), and to find the regression coefficient of linearity (\mathbf\theta\), i.e. \ (\mathbf{y = x\theta}\). The dimension of which \ (\mathbf{y}\) is mx1,\ (\mathbf{x}\) is mxn, and the dimension of \ (\mathbf{\theta}\) is nx1. M represents the number of samples, and N represents the dimension of the sample Feature.
In order to get the linear regression coefficients \ (\mathbf{\theta}\), We need to define a loss function, an optimization method for minimizing the loss function, and a method for validating the Algorithm. The difference of the loss function, the difference of the optimization method of the loss function, the different verification method, the different linear regression algorithm is Formed. The linear regression algorithm library in Scikit-learn can find their differences from these three points. Understanding these different points, the use of different algorithms to understand the scene is Good.
1. linearregression
Loss Function:
The Linearregression class is the most common linear regression we normally call, and its loss function is the simplest, as Follows:
\ (J (\mathbf\theta) = \frac{1}{2} (\mathbf{x\theta}-\mathbf{y}) ^t (\mathbf{x\theta}-\mathbf{y})
Optimization method of Loss Function:
For this loss function, there are generally two optimization methods for minimizing loss function, the gradient descent method and the least squares method, while the Linearregression class in Scikit is the least square. The linear regression coefficients \ (\mathbf\theta\) can be solved by the least squares method:
\ (\mathbf{\theta} = (\mathbf{x^{t}x}) ^{-1}\mathbf{x^{t}y} \)
Verification Method:
The Linearregression class does not use validation methods such as cross-validation, and we need to separate the datasets into training sets and test sets, and then train the Optimizations.
Usage scenarios:
In general, as long as we feel that the data has a linear relationship, the Linearregression class is our First. If it is found that the fitting or prediction is not good, then consider using other linear regression libraries. If you are learning linear regression, It is recommended to start with this class first step Study.
2. Ridge
Loss Function:
Since the linearregression of the first section does not consider the problem of fitting, it is possible that the loss function can be added to the regularization term, if the L2 norm is added, This is the ridge Regression. The loss function is as Follows:
\ (J (\mathbf\theta) = \frac{1}{2} (\mathbf{x\theta}-\mathbf{y}) ^t (\mathbf{x\theta}-\mathbf{y}) + \frac{1}{2}\alpha| | \theta| | _2^2\)
where \ (\alpha\) is a constant coefficient, it needs to be tuned. \(|| \theta| | _2\) is the L2 norm.
Ridge regression in the case of not abandoning any one feature, the regression coefficients are reduced so that the model is relatively stable and not fit.
Optimization method of Loss Function:
For this loss function, there are generally two optimization methods for minimizing loss function, the gradient descent method and the least squares method, while the ridge class in Scikit is the least square. The linear regression coefficients \ (\mathbf\theta\) can be solved by the least squares method:
\ (\mathbf{\theta} = (\mathbf{x^{t}x}) ^{-1}\mathbf{x^{t}y} \)
\ (\mathbf{\theta = (x^tx + \alpha E) ^{-1}x^ty}\)
where e is the unit matrix.
Verification Method:
Ridge class does not use verification methods such as cross-validation, we need to separate the data set into the training set and test set, we need to set the Super parameter \ (\alpha\). Then the training is Optimized.
Usage scenarios:
In general, as long as we feel that the data has a linear relationship, using the Linearregression class to fit is not particularly good, need regularization, you can consider the ridge class. But the biggest drawback of this class is that each time we have to specify a super parameter \ (\alpha\), and then own evaluation \ (\alpha\) good or bad, more trouble, generally I use the next section of the Ridgecv class to run ridge return, not recommended directly with this ridge class, Unless you just want to learn Ridge Return.
3. RIDGECV
The optimization method of loss function and loss function of RIDGECV class is exactly the same as the ridge class, the difference is in the verification method.
Verification Method:
The Ridgecv class uses cross-validation for hyper-parameters \ (\alpha\) to help us choose a suitable \ (\alpha\). When initializing the RIDGECV class, we can pass a set of alternative \ (\alpha\) values, 10, 100. The Ridgecv class will help us choose a suitable \ (\alpha\). We are relieved of our own turn of the wheel screening \ (\alpha\) distress.
Usage scenarios:
In general, as long as we feel that the data has a linear relationship, using the Linearregression class to fit is not particularly good, need regularization, you can consider the RIDGECV class. You don't have to ridge class to learn Anything. Why is this just about using the RIDGECV class? Because there are many variants of linear regression regularization, Ridge is just one of Them. So it might be better to Choose. The Ridgecv class is inappropriate if the dimension of the input feature is very high and is a sparse linear relationship. At this point should mainly consider the following several sectionto referred to the Lasso regression class family.
4. Lasso
Loss Function:
The L1 regularization of linear regression is often referred to as the Lasso regression, and the difference between it and the ridge regression is that the L1 regularization is added to the loss function instead of the L2 regularization Item. L1 regularization also has a constant coefficient \ (\alpha\) to adjust the weight of the mean variance and regularization items of the loss function, and the loss function expression for the specific lasso regression is as follows:
\ (J (\mathbf\theta) = \frac{1}{2m} (\mathbf{x\theta}-\mathbf{y}) ^t (\mathbf{x\theta}-\mathbf{y}) + \alpha| | \theta| | _1\)
where n is the number of samples, \ (\alpha\) is the constant coefficient, need to be tuned. \(|| \theta| | _1\) is the L1 norm.
Lasso regression can make the coefficients of some features smaller, or even some of the smaller absolute coefficients directly into 0. Enhance the generalization ability of the Model.
Optimization method of Loss Function:
There are two common methods for the loss function optimization of lasso regression, the axis descent method and the minimum angle regression method. The Lasso class uses the axis descent method, and the Lassolars class is followed by the least-angular regression method.
Verification Method:
The Lasso class does not use validation methods such as cross-validation, similar to the ridge class. We need to separate the data set into training and test sets, and we need to set the Hyper parameter \ (\alpha\) Ourselves. Then the training is Optimized.
Usage scenarios:
In general, for high dimensional characteristic data, especially the linear relationship is sparse, we will use Lasso Regression. Or to find the main features in a bunch of features, then lasso return is Preferred. But the Lasso class needs its own tuning of \ (\alpha\), So it is not the first choice of Lasso regression, generally used is the next SectionTo talk LASSOCV class.
5. LASSOCV
The optimization method of loss function and loss function of LASSOCV class is exactly the same as the Lasso class, the difference is in the verification method.
Verification Method:
The LASSOCV class uses cross-validation for hyper-parameters \ (\alpha\) to help us choose a suitable \ (\alpha\). When initializing the LASSOCV class, we can pass a set of alternative \ (\alpha\) values, 10, 100. The LASSOCV class will help us choose a suitable \ (\alpha\). We are relieved of our own turn of the wheel screening \ (\alpha\) distress.
Usage scenarios:
The LASSOCV class is the first choice for Lasso Regression. The LASSOCV class is also required when we are confronted with the main features found in a bunch of high-level features. LASSOCV also works well when faced with sparse linear relationships.
6. Lassolars
The loss function and the verification method of the Lassolars class are the same as the Lasso class, the difference is the optimization method of the loss Function.
Optimization method of Loss Function:
There are two common methods for the loss function optimization of lasso regression, the axis descent method and the minimum angle regression method. The Lassolars class uses the Least-angular regression method, and the Lasso class mentioned earlier is the axis descent method.
Usage scenarios:
The Lassolars class needs its own tuning to \ (\alpha\), So it is not the preferred choice for Lasso regression, and is generally used for the next SectionTo LASSOLARSCV class.
7. LASSOLARSCV
The optimization method of loss function and loss function of LASSOLARSCV class is exactly the same as the Lassolars class, the difference is in the verification method.
Verification Method:
The LASSOLARSCV class uses cross-validation for hyper-parameters \ (\alpha\) to help us choose a suitable \ (\alpha\). When initializing the LASSOLARSCV class, we can pass a set of alternative \ (\alpha\) values, 10, 100. The LASSOLARSCV class will help us choose a suitable \ (\alpha\). We are relieved of our own turn of the wheel screening \ (\alpha\) distress.
Usage scenarios:
The LASSOLARSCV class is the second choice for Lasso Regression. The first choice is the previous LASSOCV class. So does the LASSOLARSCV class have a suitable scene? In other words, when is the Least-angular regression method better than the axis descent method? Scenario One: If we want to explore more relevant values of the Hyper parameter \ (\alpha\), The regression path can be seen due to the minimum angle regression, which is better with LASSOLARSCV. Scenario Two: If our sample number is much smaller than the sample feature number, using LASSOLARSCV is better than lassocv. The rest of the scene is best used LASSOCV.
8. Lassolarsic
The optimization method of loss function and loss function of Lassolarsic class is exactly the same as the LASSOLARSCV class, the difference is in the verification method.
Verification Method:
The Lassolarsic class does not use cross-validation for hyper-parameters \ (\alpha\), But instead uses the Akaike information guidelines (AIC) and the Bayesian information guidelines (BIC). We do not need to specify an alternative \ (\alpha\) value at this point, but rather by the Lassolarsic class based on the AIC and BIC Selection. With the Lassolarsic class we can find the Super parameter \ (\alpha\), and with k-fold cross-validation, we need k+1 wheel to Find. By contrast the Lassolarsic class looks for \ (\alpha\) faster.
Usage scenarios:
As you can see from the validation method, validation \ (\alpha\) Lassolarsic is much faster than lassolarscv. Then is Lassolarsic class must be better than LASSOLARSCV class? Not necessarily! Because the AIC and BIC Guidelines are used, our data must meet certain criteria in order to use the Lassolarsic class. Such a guideline requires an appropriate estimate of the degree of freedom of the Solution. The estimate is from a large sample (asymptotic Result) and assumes that the model is correct (that is, the data is actually produced by a hypothetical model). When the number of conditions for a problem to be solved is poor (such as when the number of features is greater than the number of samples), the criteria are at risk of collapsing. So unless we know that the data comes from a large sample of a model, and the number of samples is large enough, we can use Lassolarsic. In fact, most of the data we get does not meet this requirement, and in practice I have not used this seemingly beautiful class.
9. Elasticnet
Loss Function:
Elasticnet can be regarded as the lasso and ridge of the golden mean of the Product. It is also a regular linear regression, but its loss function is not all L1 regularization, is not all L2 regularization, but with a weight parameter \ (\rho\) to balance the proportion of L1 and L2 regularization, forming a completely new loss function as Follows:
\ (J (\mathbf\theta) = \frac{1}{2m} (\mathbf{x\theta}-\mathbf{y}) ^t (\mathbf{x\theta}-\mathbf{y}) + \alpha\rho| | \theta| | _1 + \frac{\alpha (1-\rho)}{2}| | \theta| | _2^2\)
where \ (\alpha\) is a regularization hyper-parameter, \ (\rho\) is a norm-weighted hyper-parameter.
Optimization method of Loss Function:
There are two common methods for the loss function optimization of elasticnet regression, the axis descent method and the minimum angle regression method. The Elasticnet class uses an axis descent method.
Verification Method:
The Elasticnet class does not use validation methods such as cross-validation, similar to the lasso class. We need to separate the datasets into training sets and test sets, and we need to set up our own hyper-parameters \ (\alpha\) and \ (\rho\). Then the training is Optimized.
Usage scenarios:
The Elasticnet class needs its own tuning of \ (\alpha\) and \ (\rho\), So it is not the preferred choice for elasticnet regression, and is generally used for the next SectionTo class.
Ten. ELASTICNETCV
The optimization method of loss function and loss function of ELASTICNETCV class is exactly the same as the elasticnet class, the difference is in the verification method.
Verification Method:
The ELASTICNETCV class uses cross-validation for hyper-parameters \ (\alpha\) and \ (\rho\) to help us choose the appropriate \ (\alpha\) and \ (\rho\). When initializing the ELASTICNETCV class, we can pass a set of alternate \ (\alpha\) values and \ (\rho\), 10, 100. The ELASTICNETCV class will help us choose a suitable \ (\alpha\) and \ (\rho\). Removed from our own to go round the wheel filter \ (\alpha\) and \ (\rho\) distress.
Usage scenarios:
The ELASTICNETCV class is used when we find that the regression with Lasso is too (too many features are sparse to 0), and the regression with ridge is not enough (the regression coefficients decay too slowly). Generally not recommended to get the data directly on the ELASTICNETCV.
Orthogonalmatchingpursuit.
Loss Function:
The difference between the Orthogonalmatchingpursuit (OMP) algorithm and the normal linear regression loss function is the addition of a limiting term to limit the maximum number of non-0 elements in the regression Coefficients. A completely new loss function is formed as follows:
\ (J (\mathbf\theta) = \frac{1}{2} (\mathbf{x\theta}-\mathbf{y}) ^t (\mathbf{x\theta}-\mathbf{y})
Subject to \ (| | \theta| | _0 \leq n_{non-zero-coefs}\), where \ ((| | \theta| | _0\) represents the L0 norm of \ (\theta\), which is the number of non-0 regression Coefficients.
Optimization method of Loss Function:
The Orthogonalmatchingpursuit class uses the forward selection algorithm to optimize the loss Function. It is a shrinking version of the Minimum-angle regression algorithm. Although the accuracy is not as good as the minimum angle regression algorithm, the operation is Fast.
Verification Method:
The Orthogonalmatchingpursuit class does not use validation methods such as cross-validation, similar to the lasso class. We need to separate the datasets into training and test sets, and we need to choose our own constraint parameters \ (n_{non-zero-coefs}\). Then the training is Optimized.
Usage scenarios:
The Orthogonalmatchingpursuit class needs its own choice \ (n_{non-zero-coefs}\), So it is not the preferred Orthogonalmatchingpursuit return, The general use is the next SectionTo speaking of the ELASTICNETCV class, but if you have set the value of \ (n_{non-zero-coefs}\), It is more convenient to use Orthogonalmatchingpursuit.
Orthogonalmatchingpursuitcv.
The optimization method of loss function and loss function of ORTHOGONALMATCHINGPURSUITCV class is exactly the same as the Orthogonalmatchingpursuit class, the difference is in the verification method.
Verification Method:
The ORTHOGONALMATCHINGPURSUITCV class uses Cross-validation to select the best \ (n_{non-zero-coefs}\) in S-fold cross-validation with the minimum MSE as the Criterion.
Usage scenarios:
The ORTHOGONALMATCHINGPURSUITCV class is usually used for feature selection of sparse regression coefficients, and this is similar to LASSOCV. however, due to its loss function optimization method is the forward selection algorithm, the accuracy is low, the general situation is not particularly recommended, with LASSOCV is enough, unless you are concerned about the exact number of sparse regression coefficients, you can consider using ORTHOGONALMATCHINGPURSUITCV.
Multitasklasso.
From this section to the 16th section, the class has a "multitask" prefix. however, He is not multi-threading in programming, but refers to multiple linear regression models that share sample features, but have different regression coefficients and feature outputs. The specific linear regression model is \ (\mathbf{y = xw}\). where x is the matrix of the MXN Dimension. W is the matrix of the nxk dimension, and y is the matrix of the Mxk Dimension. M is the sample number, n is the sample feature, and K represents the number of multiple regression models. The so-called "multitask" here actually refers to the K linear regression model together to Fit.
Loss Function:
Because there are multiple linear regression fits together, the loss function is quite different from the previous one:
\ (J (\mathbf{w}) = \frac{1}{2m}\mathbf{(| | xw-y| |) _{fro}^2} + \alpha| | \mathbf{w}| | _{21}\)
where, \ (\mathbf{(| | xw-y| |) _{fro}}\) is the Frobenius norm of \ (\mathbf{y = xw}\). and \ (\mathbf{| | w| | _{21}}\) represents the sum of the squares of the root of the columns of W.
Optimization method of Loss Function:
The Multitasklasso class uses the Axis descent method to optimize the loss Function.
Verification Method:
The Multitasklasso class does not use validation methods such as cross-validation, similar to the lasso class. We need to separate the data set into training and test sets, and we need to set the Hyper parameter \ (\alpha\) Ourselves. Then the training is Optimized.
Usage scenarios:
The Multitasklasso class needs its own tuning to \ (\alpha\), So it is not the preferred choice for shared feature co-regression, which is commonly used in the next SectionTo MULTITASKLASSOCV class.
Multitasklassocv.
The optimization method of loss function and loss function of MULTITASKLASSOCV class is exactly the same as the Multitasklasso class, the difference is in the verification method.
Verification Method:
The MULTITASKLASSOCV class uses cross-validation for hyper-parameters \ (\alpha\) to help us choose a suitable \ (\alpha\). When initializing the LASSOLARSCV class, we can pass a set of alternative \ (\alpha\) values, 10, 100. The MULTITASKLASSOCV class will help us choose a suitable \ (\alpha\).
Usage scenarios:
MULTITASKLASSOCV is preferred when multiple regression models need to share sample features Together. It ensures that the selected features are used by each model. There is no case where a model has selected a feature and another model has not chosen this feature.
Multitaskelasticnet.
Loss Function:
The models of the Multitaskelasticnet class and the Multitasklasso class are the Same. however, The loss function is Different. The loss function expression is as Follows:
\ (J (\mathbf{w}) = \frac{1}{2m}\mathbf{(| | xw-y| |) _{fro}^2} + \alpha\rho| | \mathbf{w}| | _{21} + \frac{\alpha (1-\rho)}{2}\mathbf{(| | w| |) _{fro}^2}\)
where, \ (\mathbf{(| | xw-y| |) _{fro}}\) is the Frobenius norm of \ (\mathbf{y = xw}\). and \ (\mathbf{| | w| | _{21}}\) represents the sum of the squares of the root of the columns of W.
Optimization method of Loss Function:
The Multitaskelasticnet class uses the Axis descent method to optimize the loss Function.
Verification Method:
The Multitaskelasticnet class does not use validation methods such as cross-validation, similar to the lasso class. We need to separate the datasets into training sets and test sets, and we need to set up our own hyper-parameters \ (\alpha\) and \ (\rho\). Then the training is Optimized.
Usage scenarios:
The Multitaskelasticnet class needs to tune itself to \ (\alpha\), so it is not the preferred choice for shared feature co-regression, if needed with multitaskelasticnet, Generally used is the next SectionTo speaking of the MULTITASKELASTICNETCV class.
Multitaskelasticnetcv.
The optimization method of loss function and loss function of MULTITASKELASTICNETCV class is exactly the same as the multitaskelasticnet class, the difference is in the verification method.
Verification Method:
The MULTITASKELASTICNETCV class uses cross-validation for hyper-parameters \ (\alpha\) and \ (\rho\) to help us choose the appropriate \ (\alpha\) and \ (\rho\). When initializing the MULTITASKELASTICNETCV class, we can pass a set of alternate \ (\alpha\) values and \ (\rho\), 10, 100. The ELASTICNETCV class will help us choose a suitable \ (\alpha\) and \ (\rho\). Removed from our own to go round the wheel filter \ (\alpha\) and \ (\rho\) distress.
Usage scenarios:
MULTITASKELASTICNETCV is one of the two alternatives that multiple regression models need to share together with a sample feature, preferably a multitasklassocv. If we find that the regression coefficients decay too quickly when using multitasklassocv, then we can consider using MULTITASKELASTICNETCV.
Bayesianridge.
The Bayesian regression model is spoken in sections 17th and 18. The Bayesian regression model assumes a priori probability, the likelihood function and the posterior probability are normal distributions. A priori probability is assumed that the model output y is a normal distribution conforming to the mean \ (x\theta\), and the regularization parameter \ (\alpha\) is considered to be a random variable that needs to be estimated from the Data. The prior distribution law of regression coefficient \ (\theta\) is spherical normal distribution, and the super parameter is \ (\lambda\). We need to estimate the hyper-parameters \ (\alpha\) and \ (\lambda\), as well as the regression coefficients \ (\theta\) by maximizing the marginal likelihood Function.
here, The loss function is a negative maximization of the marginal likelihood function is not much discussion, but its form and ridge regression loss function is very similar, so also named Bayesianridge.
Usage scenarios:
If our data has many missing or contradictory pathological data, we can consider the Bayesianridge class, which is very robust to ill-conditioned data, and does not need cross-validation to select Hyper-parameters. however, the inference process of the maximum likelihood function is time consuming, but it is not recommended in General.
Ardregression.
Ardregression and Bayesianridge are very similar, the only difference is the priori distribution hypothesis of regression coefficients \ (\theta\). Bayesianridge hypothesis that the prior distribution law of \ (\theta\) is a spherical normal distribution, and ardregression loses the hypothesis of spherical Gaussian in bayesianridge, and uses an elliptical Gaussian distribution parallel to the Axis. So the corresponding hyper parameter \ (\lambda\) has n dimensions, which are Different. The \theta\ (\lambda\) corresponding to the spherical distribution in the upper Bayesianridge is only One.
Ardregression also estimates the Hyper-parameter \ (\alpha\) and \ (\lambda\) vectors, as well as the regression coefficients \ (\theta\), by maximizing the marginal likelihood Function.
Usage scenarios:
If our data have a lot of missing or contradictory pathological data, can consider Bayesianridge class, if find fit is not good, can change ardregression try. Because Ardregression's hypothesis of a priori distribution of regression coefficients is not bayesianridge rigorous, it can sometimes produce better posterior results than bayesianridge.
The above is a summary of linear regression in scikit-learn, hoping to help friends.
(welcome reprint, Reproduced please indicate the Source.) Welcome to Communicate: [email Protected])
Scikit-learn linear regression Algorithm Library summary