The specific explanation of machine Learning Classic algorithm and Python implementation--linear regression (Linear Regression) algorithm

Source: Internet
Author: User

(i) Recognition of the return

Regression is one of the most powerful tools in statistics.

Machine learning supervised learning algorithm is divided into classification algorithm and regression algorithm, in fact, according to the category label distribution type is discrete, continuity and definition.

Name implies. Classification algorithm is used for discrete distribution prediction, such as KNN, decision tree, naive Bayesian, AdaBoost, SVM, logistic regression, which are all classified algorithms. The regression algorithm is used for continuous distributed pre-measurement. For a numerical sample, use regression. The ability to predict a value at a given input. This is an upgrade to the classification method, as it is possible to pre-measure continuous data rather than discrete category labels.

The goal of regression is to establish a regression equation to predict the target value. The solution of regression is to find the regression coefficients of the regression equation. The method of prediction is, of course, quite simple, with the regression coefficients multiplied by the input values and all added together to obtain a pre-measured value.

1, definition of regression

The simplest definition of regression is to give a point set, D, to fit the point set with a function. Moreover, the error between the point set and the fitted function is minimized, assuming that the function curve is a straight line, it is called linear regression, and the assumption curve is a two-time curve, it is called two times regression.

2, multivariate linear regression

Assuming that the function relationship between the predicted values and the sample features is linear, the task of regression analysis is based on the observed values of the samples x and Y. Go to the expected function H and seek the approximate functional relationship between the variables. Defined:


Among them, n = number of features;

XJ = The value of the J characteristic of each training sample, which can be thought of as the J value in the eigenvector.

For convenience. Kee x0= 1. Then multivariate linear regression can be recorded as:

。 (θ, x mean (n+1,1) Willi Vector)

Note: Attention to diversity and multiple times is two different concepts. "Multivariate" refers to the equation has a plurality of parameters. "Multiple" refers to the highest power of the number of parameters in the equation. The multivariate linear equation is if the predicted value Y and the sample all eigenvalues conform to a multivariate one-time linear equation.

3, Generalized linear regression

With a generalized linear function:

WJ is the coefficient and w is the vector of the coefficients. It affects the degree of influence of Φj (x) in the regression function in different dimensions, and Φ (x) is able to switch to different functions. This model we think is a generalized linear model, Φ (x) =x is a multivariate linear regression model.

(b) The solution of linear regression

When it comes to regression, it's often referred to as linear regression. Therefore, the solution of multivariate linear regression equation is expounded in this paper. If there is a continuous Type value label (label value distribution is y) of the sample, there are x={x1,x2,..., xn} features, regression is to solve the regression coefficient θ=θ0, θ1 ,..., partθn . So, how can you find theta if you have x and y in your hand? In the regression equation, the method to obtain the corresponding optimal regression coefficients is to minimize the sum of squares of errors.

The error here refers to the difference between the predicted Y value and the true Y value, and the simple accumulation of the error will make the positive and negative values cancel each other out. So the squared error (least squares) is used.

The squared error can be written:


As to why the minimum squared error sum is used to solve the problem, its statistical principle can refer to the "Deep linear regression" section of "linear regression, logistic regression, concept learning of various regression".

In mathematics. The solution process is transformed into a set of θ values to obtain the minimum value, then the solution method has gradient descent method, Normal equation and so on. Gradient descent has the following characteristics: the need to pre-select step A, need multiple iterations, the eigenvalues need scaling (unified to the same scale range). So it's more complicated. Another solution that does not need to be iterative is--normal equation, which is simple, convenient and does not require feature Scaling. The Normal equation method needs to calculate the transpose and inverse matrix of x, the computational amount is very large, so the number of features is very slow, it is only suitable for use when the number of features is less than 100000, and the gradient method is used when the number of features is greater than 100000. In addition, when x is not reversible, there is the ridge regression algorithm.

Here's a summary of some of the algorithms that are often used.

1. Gradient Descent method (Gradient descent)

Based on squared error. The loss function (cost function) that defines the linear regression model is:

, (coefficients are for the convenience of derivative display)

The relationship between the value of the loss function of linear regression and the regression coefficient θ is bowl-shaped. There is only one minimum point. The solution process of linear regression is like logistic regression, the difference is that the learning model function hθ (x) is different, the specific solution process of the gradient method is "the specific explanation of machine learning classical algorithm and the implementation of Python---logistic regression (LR) classifier".

2,normal equation (also known as ordinary least squares)

The normal equation algorithm is also called ordinary least squares (ordinary least squares), which is characterized by a given matrix X, assuming that the inverse of the XTX exists and can be obtained. Can be solved directly using this method.

Its solution theory is also very simple: since it is to find the minimum error squared sum. The regression coefficient can be obtained by the derivative of 0.


The Matrix X is the (m,n+1) matrix (m represents the number of samples, n represents the number of characteristics of a sample), and Y is (m). 1) column vector.

The above formula includes XTX, which is the need to inverse the matrix, so this equation is only applicable when the inverse matrix exists. However. The inverse of the matrix may not exist, and the following "Ridge regression" will discuss the processing method.

3. Local weighted linear regression

One of the problems with linear regression is the possibility of an under-fitting phenomenon. Because it asks for unbiased prediction with minimum mean square error. It is obvious that it is not possible to achieve the best pre-test results if the model is not fitted. So some methods agree to be biased in anticipation, thus reducing the mean square error of the pre-measured. One of the methods is local weighted linear regression (locallyweightedlinearregression, LWLR). In this algorithm, we give a certain weight to each point near the pre-test point. The formula then becomes:

。 W is the (m,m) matrix and M represents the number of samples.

LWLR uses a "kernel" (similar to a kernel in a support vector machine) to give higher weights to nearby points.

The types of nuclei are freely selectable, the most commonly used nuclei are the Gaussian nuclei, and the corresponding weights of the Gaussian nuclei are as follows:

, K need to optimize the selection.

There is also a problem with local weighted linear regression, which is that the computational amount is added, because it must use the entire data set when it is pre-measured for each point. Instead of calculating the regression coefficients to get the regression equation descendants into the calculation can be. Therefore, the algorithm is not recommended.

4, Ridge regression (ridge regression) and reduction method

When the number of samples in the data is less than the number of features. The inverse of matrix XTX cannot be calculated directly. Even when the number of samples is more than the number of features, the inverse of XTX may not be calculated directly. This is due to the possibility that features are highly correlated. It is then possible to consider the use of Ridge regression, as the inverse of XTX cannot be computed. It still guarantees that regression parameters can be obtained. Simply put. Ridge regression is to correct the Matrix XTX, into (i is the unit matrix, diagonal 1, the other 0) so that the matrix is not mysterious, and then can be inverse. In such a case, the formula for the regression coefficients will become:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvc3vpcgluz3nw/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

In order to use the Ridge Regression and reduction technology, the first need to standardize the characteristics, so that the values of the value scale range of the same, so that the influence of the characteristics of the same value.

How do I set the value of λ? This test process is repeated by selecting different λ. Finally get a λ that minimizes the error of the pre-test. The best value can be obtained by cross-validation-on the test data. The sum of squared errors is minimized.

The ridge regression was first used to deal with more than a sample number of features, and is now used to increase bias in anticipation. To get better forecasts. In fact, the above formula is obtained by introducing each characteristic penalty factor in the minimum squared error and formula, in order to prevent overfitting (overly complex model), and to add a penalty factor for each feature in the loss function. This is the regularization of linear regression (refer to theCoursera public Lesson Note: Stanford University's seventh lesson on machine learning "regularization (regularization)").


Note:θ0 is a constant, x0=1 is fixed, then θ0 does not need to punish the factor, the ridge regression formula I of the first element to be 0.

This is done by introducing λ to limit the sum of squared errors by attracting the penalty. To reduce the number of unimportant parameters, this technique is also called reduction in statistics (shrinkage). The reduction method can eliminate unimportant parameters and therefore better understand the data. In addition, compared with simple linear regression, the reduction method can achieve better pre-measurement results, the reduction method can also be seen as a data model of the fit to take the deviation (the difference between the predicted value and the real value), the variance (the difference between the different prediction models) compromise scheme, the same time to add deviations to reduce variance.

The tradeoff of deviation variance is an important concept. can help us understand existing models and make improvements to get better models. Ridge regression is one of the reduction methods, which is equivalent to limiting the size of the regression coefficients.

Another very good method of reduction is lasso. Lasso is difficult to solve, but it can be used to calculate the approximate result by using the simple stepwise linear regression method. Other methods of reduction. such as Lasso, LAR, PCA regression, and subset selection. As with ridge regression, these methods can not only increase the accuracy of the prediction, but also explain the regression coefficients.

5, regression model performance metrics

The regression equation computed on the data set does not necessarily mean that it is optimal. The regression equation can be measured by the correlation between the pre-measured value Yhat and the original value Y.

Correlation value range 0~1, the higher the value, the better the regression model performance.

Linear regression is if the relationship between a value label and a characteristic value is linear. But sometimes the relationship between data can be more complicated. The use of linear models is difficult to fit, it is necessary to introduce polynomial curve regression (multiple multiple fitting) or other regression models. such as a regression tree.

(iii) Python implementations of linear regression

The common least squares and ridge regression algorithms are implemented in this linear regression learning package, as the gradient method and the logistic regression are almost the same. There is no characteristic number >10000 sample to test the speed of operation, so there is no implementation. In order to support a variety of solution methods, but also to facilitate the expansion of other solutions, the Linearregress object uses Dict to store the relevant parameters (the solution is key, the regression coefficients and other related parameters list is value).

For example, the ridge regression algorithm in the lrdict key= ' Ridge ', Value=[ws, Lamba,xmean,var, Ymean]. As the ridge regression model is trained and predicted, samples need to be feature scaling. So we need to store Xmean,var, Ymean. The properties of the Linearregress object as seen by its __init__ function:

Source Code:Copy
  1. class Linearregress (object):
  2. def __init__ (Self, lrdict = None, **args):
  3. "' currently support OLS, Ridge, LWLR
  4. " "
  5. Obj_list = Inspect. Stack () [1][-2]
  6. Self. __name__ = obj_list[0].split (' = ') [0].strip ()
  7. if  not Lrdict:
  8. Self. Lrdict = {}
  9. Else:
  10. Self. Lrdict = Lrdict
  11. #to Numpy Matraix
  12. if ' OLS ' in self. Lrdict:
  13. Self. lrdict[' OLS ' = Mat (self.) lrdict[' OLS ')
  14. if ' Ridge ' in self. Lrdict:
  15. Self. lrdict[' Ridge '][0] = Mat (self. lrdict[' Ridge '][0])
  16. Self. lrdict[' Ridge '][2] = Mat (self. lrdict[' Ridge '][2])
  17. Self. lrdict[' Ridge '][3] = Mat (self. lrdict[' Ridge '][3])
  18. Self. lrdict[' Ridge '][4] = Mat (self. lrdict[' Ridge '][4])

The linear regression Model Python learning package is:

Machine learning Linear regression-linear regression

(iv) Application and model tuning

to predict a value based on a combination of characteristics (such as pre-measured price, dish price, etc.) and the relationship between the pre-measured value and the characteristic combination is linear, it is possible to use linear regression to establish a pre-measured model. After the establishment of a model by machine learning algorithm, it is necessary to continuously tune and revise in use, for linear regression. The best model is to obtain the balance between the pre-measured deviation and the model variance (the high deviation is the under-fitting, the high variance is the overfitting). The method of model tuning and correction in linear regression model includes:

-Get a lot of other training samples-solve high variance

-Try to use a collection of fewer features-resolve high variance

-Try to get other features-resolve high deviations

-Try to add multiple combinations of features-resolve high deviations

-Try to reduce λ -resolve high deviations

-Try adding λ -Resolve high variance

The specific elaboration can take the "Stanford University Machine Study Tenth lesson" The Application Machine study suggestion "

References:

Linear regression and logistic regression

Getting Started with machine learning: linear regression and gradient descent

The concept learning of linear regression, logistic regression and various regression

Coursera Public Course notes: Stanford University machine Learning Seventh "regularization (regularization)

Review of algorithms used in feature selection frequently

author of this article Adan, derived from: machine learning classical algorithm specific interpretation and Python implementation--linear regression (Linear Regression) algorithm .

Reprint please indicate the source.

The specific explanation of machine Learning Classic algorithm and Python implementation--linear regression (Linear Regression) algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.