A classical algorithm for machine learning and Python implementation--linear regression (Linear Regression) algorithm

Source: Internet
Author: User

(i) Recognition of the return

Regression is one of the most powerful tools in statistics. Machine learning supervised learning algorithm is divided into classification algorithm and regression algorithm, in fact, according to the category label distribution type is discrete, continuity and defined. As the name implies, the classification algorithm is used for discrete distribution prediction, such as KNN, decision tree, naive Bayesian, AdaBoost, SVM, logistic regression are classification algorithms, and the regression algorithm is used for continuous distribution prediction, which is for the numerical sample, using regression, A value can be predicted at the given input, which is an improvement to the classification method, because it predicts continuous data rather than just discrete category labels.

The goal of regression is to establish a regression equation to predict the target value, and the regression method is to find the regression coefficient of the regression equation. The method of prediction is, of course, very simple, and the regression coefficients multiplied by the input values are all added together to get the predicted values.

1, definition of regression

The simplest definition of regression is to give a point set D, to fit the point set with a function, and to minimize the error between the point set and the fitted function, if the function curve is a straight line, it is called linear regression, and if the curve is a two-time curve, it is called a two-time regression.

2, multivariate linear regression

Assuming that the function relationship between the predicted values and the sample features is linear, the task of regression analysis is to estimate the function H based on the observed values of the samples x and Y, and seek the approximate functional relationship between the variables. Defined:


where n = number of features;

XJ = The value of the J characteristic of each training sample, which can be considered as the J value in the eigenvector.

For convenience, remember X0= 1, the multivariate linear regression can be recorded as:

, (θ, x all represent (n+1,1) Willi vectors)

Note: Notice that multiple and multiple are two different concepts, "multivariate" refers to the equation has multiple parameters, "multiple" refers to the equation in the highest power of the parameters. The multivariate linear equation is the hypothesis that the predicted value y conforms to a multivariate linear equation with all eigenvalues of the sample.

3, Generalized linear regression

With a generalized linear function:

WJ is a coefficient, W is the vector of the coefficients, it affects the different dimensions of the φj (x) in the regression function of the impact degree, Φ (x) can be replaced by different functions, such a model we think is a generalized linear model, φ (x) =x is a multivariate linear regression model.

(b) The solution of linear regression

When it comes to regression, it is often referred to as linear regression, so this paper describes the solution of multiple linear regression equations. Suppose there is a continuous Type value label (label value distribution is y) of the sample, there are x={x1,x2,..., xn} features, regression is to solve the regression coefficient θ=θ0, θ1 ,..., partθn . So, how do you find theta when there are X and y in your hand? In the regression equation, the method of finding the best regression coefficients corresponding to the characteristics is the sum of the squares of minimizing errors. The error here is to predict the difference between the Y value and the true Y value, and using the simple summation of the error will make the positive and negative values cancel each other out, so the square error (least squares) is used. The squared error can be written:


As to why the minimum squared error is used to solve the problem, its statistical principle can refer to the "Deep linear regression" section of "linear regression, logistic regression, concept learning of various regression".

In mathematics, the solution process is transformed into a set of θ values to obtain the minimum value, then the method has gradient descent method, Normal equation and so on. The gradient drop has the following characteristics: A pre-selection of step A, multiple iterations, and a scaling of the eigenvalues (uniformly to the same scale range) is required. Therefore, there is a more complex, and there is a solution that does not need to be iterative--normal equation, simple, convenient, do not need feature Scaling. The Normal equation method needs to calculate the transpose and inverse matrix of x, the computational amount is very large, so the number of features will be very slow, only applicable to the number of features less than 100000 when the use of the gradient method when the number of features is greater than 100000. In addition, when x is not reversible, there is the ridge regression algorithm.

The following is a summary of several commonly used algorithms.

1, Gradient descent method (Gradient descent)

Based on the squared error, the loss function (cost function) for defining the linear regression model is:

, (coefficients are for the convenience of derivative display)

The relationship between the value of the loss function of linear regression and the regression coefficient θ is bowl-shaped, with only a minimum point. The solution process of linear regression is like logistic regression, the difference is that the learning model function hθ (x) is different, the gradient method specific solution process reference "machine learning classical algorithm detailed and Python implementation---logistic regression (LR) classifier".

2,normal equation (also known as ordinary least squares)

The normal equation algorithm is also called ordinary least squares (ordinary least squares), which is characterized by: given the Matrix X, if the inverse of the XTX exists and can be obtained, it can be directly solved by the method. Its solution theory is also very simple: since it is to find the minimum error sum of squares, the other derivative of 0 can be derived regression coefficients.


The Matrix X is the (m,n+1) matrix (m represents the number of samples, n represents the number of features for a sample) and Y is the (m,1) column vector.

The above formula contains XTX, which is the need to inverse the matrix, so this equation is only applicable when the inverse matrix exists. However, the inverse of the matrix may not exist, and the following "Ridge regression" will discuss the processing method.

3, local weighted linear regression

One problem with linear regression is the possibility of an under-fitting phenomenon, because it asks for unbiased estimation with minimum mean square error. Obviously, if the model does not fit, it will not get the best prediction effect. So some methods allow for some deviations in estimation, thus reducing the mean square error of the prediction. One of these methods is local weighted linear regression (locallyweightedlinearregression, LWLR). In this algorithm, we give a certain weight to each point near the predicted point. The formula then becomes:

, W is the (m,m) matrix and M represents the number of samples.

LWLR uses a "kernel" (similar to a kernel in a support vector machine) to give higher weights to nearby points. The type of nuclear is freely selectable, the most commonly used nucleus is the Gaussian nucleus, and the corresponding weights of the Gaussian nuclei are as follows:

, K needs to be optimized for selection.

There is also a problem with local weighted linear regression, which increases the computational amount because it must use the entire data set when predicting each point, rather than calculating the regression coefficients to get the descendants of the regression equation. Therefore, the algorithm is not recommended.

4, Ridge regression (ridge regression) and reduction method

When the number of samples of data is less than the number of features, the inverse of matrix XTX cannot be calculated directly. Even when the number of samples is more than the number of features, the inverse of the XTX may not be calculated directly, because the features are highly correlated. The ridge regression can be considered, because when the inverse of the XTX cannot be calculated, it still guarantees that the regression parameters can be obtained. To put it simply, the ridge regression is to correct the Matrix XTX, to (i is the unit matrix, the diagonal is 1, the other 0) so that the matrix is non-singular, and then can be inverse. In this case, the formula for the regression coefficients becomes:

In order to use the Ridge regression and reduction techniques, we need to standardize the characteristics of the features, so that the values of each eigenvalue have the same scale range, so that the influence of each eigenvalue is the same.

How do I set the value of λ? By selecting a different λ to repeat the test process, a λ that minimizes the prediction error is obtained. The best value can be obtained by cross-validation-the sum of squared errors is minimized on the test data.

Ridge regression was first used to deal with more than a sample number of features, and is now used to add human bias to the estimate, thus obtaining a better estimate. In fact, the above formula is obtained by introducing the penalty factor for each feature in the minimum squared error and formula, in order to prevent overfitting (overly complex models), adding a penalty factor for each feature in the loss function, which is the regularization of the linear regression (refer to "Coursera public Lesson notes: Stanford University's seventh lesson in machine learning "regularization (regularization)").


Note:θ0 is a constant, x0=1 is fixed, then θ0 does not need a penalty factor, the first element of I in the Ridge regression formula is 0.

By introducing λ to limit the sum of all squared errors, it is possible to reduce unimportant parameters by attracting the penalty, a technique that is also called reduction in statistics (shrinkage). The reduction method can remove unimportant parameters and therefore better understand the data. In addition, compared with simple linear regression, the reduction method can achieve better prediction results, the reduction method can also be seen as a data model to fit the deviation (the difference between the predicted value and the real value), the variance (the gap between different forecasting models) compromise, increase the deviation while reducing variance. The tradeoff between deviations in variance is an important concept that can help us understand existing models and make improvements to get better models. Ridge regression is one of the reduction methods, which is equivalent to limiting the size of the regression coefficients. Another good method of reduction is lasso. Lasso is difficult to solve, but the approximate result can be obtained by using a stepwise linear regression method with simple calculation. There are other methods of reduction, such as lasso, LAR, PCA regression, and subset selection. As with ridge regression, these methods can not only improve the accuracy of predictions, but also explain the regression coefficients.

4, regression model performance metrics

The regression equation computed on the data set does not necessarily mean that it is optimal, and can be used to measure the regression equation with the correlation between the predicted value Yhat and the original value Y. Correlation value range 0~1, the higher the value, the better the regression model performance.

Linear regression is the assumption that the relationship between the value label and the eigenvalues is linear, but sometimes the relationship between the data can be more complex, and using linear models is difficult to fit, you need to introduce polynomial curve regression (multiple multiple fitting) or other regression models, such as regression trees.

(iii) Python implementations of linear regression

The common least squares and ridge regression algorithms are implemented in the learning package of linear regression, because the gradient method and the logistic regression are almost the same, and there is no sample test operation speed of the characteristic number >10000, so it is not realized. In order to support multiple solutions and to extend other solutions, the Linearregress object uses Dict to store the relevant parameters (the solution is key, the regression coefficients and other related parameters list is value). For example, the ridge regression algorithm in lrdict key= ' Ridge ', Value=[ws, Lamba,xmean,var, Ymean]. Because the ridge regression model requires feature scaling of samples in training and forecasting, it is necessary to store Xmean,var, Ymean. The properties of the Linearregress object are as shown in its __init__ function:

Source Code:Copy
  1. class Linearregress (object):
  2. def __init__ (Self, lrdict = None, **args):
  3. "' currently support OLS, Ridge, LWLR
  4. " "
  5. Obj_list = Inspect. Stack () [1][-2]
  6. Self. __name__ = obj_list[0].split (' = ') [0].strip ()
  7. if  not Lrdict:
  8. Self. Lrdict = {}
  9. Else:
  10. Self. Lrdict = Lrdict
  11. #to Numpy Matraix
  12. if ' OLS ' in self. Lrdict:
  13. Self. lrdict[' OLS ' = Mat (self.) lrdict[' OLS ')
  14. if ' Ridge ' in self. Lrdict:
  15. Self. lrdict[' Ridge '][0] = Mat (self. lrdict[' Ridge '][0])
  16. Self. lrdict[' Ridge '][2] = Mat (self. lrdict[' Ridge '][2])
  17. Self. lrdict[' Ridge '][3] = Mat (self. lrdict[' Ridge '][3])
  18. Self. lrdict[' Ridge '][4] = Mat (self. lrdict[' Ridge '][4])

The linear regression Model Python learning package is:

Machine learning Linear regression-linear regression

(iv) Application

Linear regression can be used to establish a predictive model for predicting a value based on a combination of characteristics, such as predicting house prices, vegetable prices, and so on, and the relationship between the predicted value and the feature combination is linear.

Reference:

Linear regression and logistic regression

Getting Started with machine learning: linear regression and gradient descent

The concept learning of linear regression, logistic regression and various regression

Coursera Public Course notes: Stanford University machine Learning Seventh "regularization (regularization)

A survey of common algorithms for feature selection

author of this article Adan, from: The classical algorithm of machine learning and the implementation of Python--linear regression (Linear Regression) algorithm . Reprint please indicate the source.

A classical algorithm for machine learning and Python implementation--linear regression (Linear Regression) algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.