Today, let's talk about linear regression. Yes, linear regression is almost a compulsory course for all data scientists, as the oldest model of the data science community. The model analysis and test of a large number of numbers are put aside do you really know how to use linear regression? not necessarily!
Linear regression of "Paulto"
Today, deep learning has already become the new favorite of data science. Even if pushed forward 10, SVM, boosting and other algorithms can be in the accuracy of the detonation linear regression.
Why do we need linear regression?
On the one hand, the relationships that linear regression can simulate are far more than linear relationships. "Linear" in linear regression refers to the linearity of coefficients, and the function relation between the output and the feature can be highly non-linear by the nonlinear transformation of the feature and the generalization of the generalized linear model. On the other hand, it is also more important that the explanatory nature of linear model makes it occupy an irreplaceable position in the fields of physics, economics and business Science.
So how do you use Python to achieve linear regression?
Because of the widespread popularity of machine learning Library Scikit-learn, the common method is to call Linear_model to fit data from the library. While this can provide additional pipelined features for machine learning, such as: the other advantages of data normalization, regularization of model coefficients, transfer of linear models to another downstream model, but when a data analyst needs to quickly and easily determine regression coefficients (and some basic related statistics), this is usually not the quickest and easiest method.
Next, I'll introduce some faster and simpler methods, but the amount of information they provide and the flexibility of modeling is different.
8 ways to achieve linear regression
Method One: Scipy.polyfit () or Numpy.polyfit ()
This is the most basic least squares polynomial fitting function (least squares polynomial fit functions) that accepts a dataset and any dimension polynomial functions (specified by the user) and returns a set of coefficients that minimize the squared error. This gives a detailed description of the function. For simple linear regression, 1-D functions can be selected. But if you want to fit a higher dimensional model, you can construct polynomial features and fit the model from the linear feature data.
Method Two: Stats.linregress ()
This is a highly specialized linear regression function that can be found in the SCIPY statistical module. However, because it is used only to optimize the least squares regression of two sets of measured data, its flexibility is quite limited. Therefore, it can not be used for generalized linear model and multivariate regression fitting. However, because of its particularity, it is one of the quickest methods in the simple linear regression. In addition to fitting coefficients and intercept items, it returns basic statistics, such as R2 coefficients and standard deviations.
Method Three: Optimize.curve_fit ()
This is consistent with the polyfit approach, but more general in nature. This powerful function is derived from the Scipy.optimize module, which allows arbitrary user-defined functions to be fitted to the dataset through least squares minimization.
For simple linear regression, you can write only one linear MX + C function and call this estimate function. It goes without saying that it also applies to multivariate regression, and returns a function parameter array with least squares metric and a covariance matrix.
Method Four: NUMPY.LINALG.LSTSQ
This is the basic method of calculating the least squares solution of a linear equation group by matrix decomposition. A simple linear algebraic module from a numpy package. In this method, by calculating Euclid 2-Norm | | b-ax| | 2 minimized vector x to solve the equation ax = b.
The equation may have innumerable solutions, unique solutions or no solutions. If A is a phalanx and full rank, then X (rounding) is the "exact" solution of the equation.
You can use this method to do a unary or multivariate linear return return to get the calculated coefficients and residuals. A small trick is that you must add a column after the X data to compute the intercept item before calling the function. This proved to be one of the ways to solve linear regression problems more quickly.
Method Five: Statsmodels.ols ()
Statsmodels is a small Python package that provides classes and functions for many different statistical model estimates, as well as classes and functions for statistical testing and statistical data exploration. Each estimate corresponds to a list of generic results. Can be tested based on existing statistics packages to ensure the correctness of statistical results.
For linear regression, the OLS or general least squares function in the packet can be used to obtain the complete statistic information in the estimation process.
One trick to keep in mind is that you have to manually add a constant to the data x to compute the intercept, otherwise you will only get the coefficients by default. The following is a screenshot of the complete summary of the OLS model. The result is as rich as a statistical language such as R or Julia.
Methods Vi. and seven: using the inverse of the matrix to solve the analytic solution
For the linear regression problem with good condition (in which, at least the number of data points > characteristics are satisfied), the coefficient solution is equivalent to the existence of a simple closed-matrix solution, which minimizes the least squares. Given by the following formula:
Here are two options:
(a) using simple multiplication to find the inverse of a matrix
(b) First compute the moore-penrose generalized pseudo inverse matrix of x and then the dot product with Y. Because the second process involves singular value decomposition (SVD), it is slow, but it can be well applied to datasets that do not have a good condition.
Method Eight: Sklearn.linear_model. Linearregression ()
This is a typical method used by most machine learning engineers and data scientists. Of course, for real-world problems, it may be superseded by cross-validation and regularization algorithms such as lasso regression and ridge regression, but not too much, but the core of these advanced functions is the model itself.
Eight methods of efficiency competition
As a data scientist, you should always look for accurate and fast methods or functions to complete data modeling work. If the model is inherently slow, it can cause execution bottlenecks for large datasets.
Simple matrix inverse solution scheme is faster
As data scientists, we must always explore a variety of solutions to analyze and model the same tasks and choose the best solution for specific problems.
--------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------- ---------------