Machine Learning in action -- regression

Source: Internet
Author: User

Machine learning problems are classified into classification and Regression Problems.
Regression is used to predict continuous values. Unlike classification, regression is used to predict discrete types.

As to why this type of problem is called regression, it should be a convention, and you cannot explain it.
For example, the reason why logistic regression is called logistic regression is that it solves the classification problem clearly and has no relationship with the logic.

When talking about regression, the simplest thing isLinear Regression
Use a straight line to fit data points,

We usually use square errors as the objective function, which is called Ordinary Least Squares. For more information, see andrewng's handout.

Gradient descent can be used to solve this problem, but what is simpler is that the problem is resolved and can be obtained directly.

The target function can be expressed,


Evaluate W and obtain

Let the derivative = 0, you can find W

Source code,

From numpy import * def standregres (xarr, yarr): xmat = MAT (xarr); ymat = MAT (yarr ). t xtx = xmat. T * xmat if linalg. det (xtx) = 0.0: # determine whether the determinant is 0. 0 is a singular matrix, and the inverse print "This matrix is singular, cannot do inverse" Return Ws = xtx. I * (xmat. T * ymat) return WS

Linear regression is a typical high-bias and Low-variance model, because it can be said to be the simplest model.
Therefore, underfit problems may occur.

 

Locally Weighted Linear Regression

Linear regression is the unbiased estimation of the Minimum Mean Variance. Local weighting can also be seen as introducing some deviations in the estimation to reduce the mean variance of prediction.

In fact, local weighting is an option for the training set (partial samples are selected, so deviations are generated). Select the training samples that are closer to the current prediction point for linear regression, for more information about the algorithm, see andrewng's handout.

How to choose, add a weight W to each training sample

The basic idea is that the closer the prediction point, the larger the weight, the abstract representation is the Gaussian Kernel.

K indicates the range of training samples.

Source code,

Def lwlr (testpoint, xarr, yarr, K = 1.0): xmat = MAT (xarr); ymat = MAT (yarr ). t m = shape (xmat) [0] weights = MAT (eye (M) # weight is initialized as a matrix of units for J in range (m ): diffmat = testpoint-xmat [J,:] # predict the difference between x and each sample weights [J, J] = exp (diffmat * diffmat. t/(-2.0 * k ** 2) # Calculate the weight of each sample. xtx = xmat. T * (weights * xmat) If linalg. det (xtx) == 0.0: Print "This matrix is singular, cannot do inverse" Return Ws = xtx. I * (xmat. T * (weights * ymat) return testpoint * WS

The difference from linear regression is that the calculation of the weight is more important. Note that the weight matrix is a diagonal matrix, indicating the weight of each sample on the diagonal line, and the other values are 0.

Of course, local weighting can solve the issue of underfitting. The degree of fitting depends on the value of K.

It can be seen that if K is selected too small, it will also cause overfitting problems.

Of course, the problem with this algorithm is that it is non-parametric algorithm. During prediction, you need to keep the complete training set and traverse the entire training set every prediction.

 

Ridge Regression

As mentioned above, partial weighting adds deviations by selecting some training samples to solve the issue of underfitting.

Next, let's look at another type of problem,

When solving linear regression, you need to solve
However, in some cases, the X Covariance Matrix cannot find the inverse. We have seen this problem in the factorization.

For example, when the number of samples is smaller than the number of features, or X is not full rank, that is, some samples are linearly related, such as X1 = 2 * X2

For this problem, the solution is to use the shringking feature, that is, to select some features and add deviations.

One method is ridge regression,
The target function in ridge regression is added with a penalty.

The first part is to expand y-xtw in linear regression, which is equivalent
The second part is the most important part. It can be seen that if we want to minimize the entire formula, then if we simply look at the second part, it is the best if it is 0.
That is, all parameters are mostly 0, which means adding a penalty item will tend to bring the parameters of some unimportant features close to 0, so as to reduce the feature.
Lamda is the complexity. Since then, the larger the Lamda, the more powerful the shringking is, that is, the more the parameter tends to 0.

It can also be expressed as, meaning the same, limit parameter

Resolve Ridge Regression,

Source code,

Def ridgeregres (xmat, ymat, lam = 0.2): xtx = xmat. T * xmat denom = xtx + eye (shape (xmat) [1]) * Lam # Add penalty item if linalg. det (denom) = 0.0: Print "This matrix is singular, cannot do inverse" Return Ws = denom. I * (xmat. T * ymat) return WS

Note that the shrinking algorithm, including PCA and factor analysis, is required... All features must be normalized first. Otherwise, all features cannot be judged.

So I will show you how to use the ridge regression code to see how to perform normalization.

Def ridgetest (xarr, yarr): xmat = MAT (xarr); ymat = MAT (yarr ). t ymean = mean (ymat, 0) ymat = ymat-ymean xmeans = mean (xmat, 0) xvar = VAR (xmat, 0) xmat = (xmat-xmeans) /xvar # the scale of each feature is different, so the variance numtestpts = 30 wmat = zeros (numtestpts, shape (xmat) [1]) for I in range (numtestpts ): ws = ridgeregres (xmat, ymat, exp (I-10) wmat [I,:] = ws. t return wmat

Try 30 different Lamda here. Here Lamda changes exponentially. You can see the effect of the extremely large and extremely small Lamda on the results.

We can see that when Lamda is very small, the leftmost feature parameter does not have any shrinking, and the value obtained from linear regression is basically the same.
When Lamda is very large, the rightmost feature parameters will all be shrinking to 0

So we need to use cross-validation to find the appropriate Lamda value in the middle.

 

Lasso

Similar to Ridge Regression
Replace the L2 penalty from Ridge Regression with the L1 penalty.
Here, L1 and L2 represent absolute values and vertices, respectively.

So lasso can be expressed,

You may ask, is there a gross difference, from square to absolute value?

The answer is that when lasso T is small enough, it is easier to make some feature parameters directly = 0, not just close to 0, so that shrinking has a better effect.

But the square constraint is convex, instead of absolute value. It is not convex, so it is very difficult to calculate (personal understanding)
So Lasso is hard to solve

So we will introduce an approximate method.

Forward stagewise Regression

Greedy Algorithm, each time a small parameter is corrected, and if the error can be reduced, it is retained and the optimal parameter can be found through non-stop iteration.

Def stagewise (xarr, yarr, EPS = 0.01, numit = 100): xmat = MAT (xarr); ymat = MAT (yarr ). t ymean = mean (ymat, 0) ymat = ymat-ymean xmat = regularize (xmat) m, n = shape (xmat) Ws = zeros (n, 1 )); wstest = ws. copy (); wsmax = ws. copy () # The weight is initialized to 0 for I in range (numit): # number of iterations print ws. t losponror = inf; # Minimum Error initialized to infinity for J in range (n): # For each feature for sign in [-]: # increase or decrease the weight value wstest = ws. copy () wstest [J] + = EPS * sign # change the weight value ytest = xmat * wstest # Calculate the predicted value rsse = rsserror (ymat. a, ytest. a) # Calculate the prediction square error. If rsse <loginror: loginror = rsse wsmax = wstest Ws = wsmax. copy () returnmat [I,:] = ws. t return returnmat

Set different EPS values, that is, step size, to find a more suitable EPS value.

This method can obtain the result close to lasso after enough iterations.

This method also has a major advantage in helping you understand the current model and easily find unimportant features.

We can see that during this iteration, the parameters of the second and seventh features are always 0, indicating that these two features have no contribution to the error.

Machine Learning in action -- regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.