"Machine Learning Basics" Support vector regression

Last Update:2015-05-03 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

This section describes the support vector regression, which we described earlier in the nuclear logistic regression using the representation theorem (Representer theorem), the form of logistic regression programming kernel, which we proceed along this line to see how the regression problem and kernel form are combined.

Kernel Ridge Regression

The last presentation theorem tells us that if we are dealing with a linear model with L2 regularization, the optimal solution is a linear combination of the data Zn. We can turn such a linear model into a form of kernel.

Now that we know the form of the best solution for the linear regression model with L2-regularizer, then we can substitute the best solution into the original ridge regression, and solve the best w by solving the best β.

We will bring the linear combination of W to Z and this condition into the equation, get the following formula, and then convert it into a matrix multiplication representation.

The solution of Kernel Ridge regression

Above we get the loss function of the kernel Ridge regression, which requires the solution of an unconditional optimization problem, that is, the loss function Eaug (β) gradient:

Make ▽eaug (β) = 0, is to be (λi+k) β-y=0, so that we get the solution of β. So (λi+k) is not necessarily reversible, we know that the nuclear matrix K is semi-positive, so the inverse of this (λi+k) must exist.

The time complexity here is probably O (n^3), and notice that most of the elements in the matrix (λi+k) are not 0, so it is a very difficult problem to solve the inverse of a large matrix.

Comparison of Linear regression and kernel Ridge regression

The disadvantage of Linear regression is that it is more limited, but in terms of computational complexity, if the data volume n is much larger than dimension D, the calculation of linear regression is more efficient.
Kernel Ridge regression Because of the use of Kernel trick, more flexible, suitable for complex fitting. In this way, we find that the complexity of computation is related to the amount of data, and it is not appropriate to use this method if the amount of data is large.

Therefore, the difference between the linear and the nuclear methods lies in the computational efficiency and the tradeoff and compromise between the flexibility of complex problems.

Support Vector Regression

In the previous introduction, we used linear regression to classify the problem, now using kernel Ridge regression can also be classified.
We put this kernel ridge regression for classification as the least Squares SVM (least-squares SVM, LSSVM).

We compare the demarcation effect of SVM and LSSVM, we will find that although the classification boundary effect of the two graphs is similar, but the number of support vectors is very different, in LSSVM almost every data is support vector, why this?
The result of the β we calculated in the kernel Ridge regression is mostly not 0, so the number of support vectors in this way is many. This will take a lot of time when it comes to forecasting.
In the standard SVM, the coefficient α is sparse, and the β here is very dense.

Tube Regression

We now assume that there is a neutral zone tube, where the data is in this region, we do not count it into the error function, but only the distance of the data outside the region into the error function.

The loss function is defined as follows:

We compare the tube regreesion and squared regression error function curves, we can see that in the case of small error, the two measurement of error is similar, and when the error is relatively large, the squared error will grow quickly, which shows squared error is more susceptible to noise .

Next, we use the l2-regularized tube regression to get the sparse β.

l2-regularized Tube Regression

Since the tube regression containing regularizer contains the Max function, and this function is not differential, we want to imitate the solution technique in SVM to solve the problem.

We compare the SVM with the constant item B in W to get the formula to be transformed:

We recall that in SVM we introduced a new variable ξn, which records the penalty for error data, so we use ξn instead of Max and put it in the target function:

Since there is also an absolute value operation, it is not a micro operation, so we split it into two parts and set two ξ variables so that our condition becomes linear, so we can use the two-time plan to solve this equation:

Above we have the standard support Vector Regression, we become the original SVR problem.

There is an adjustable variable ε compared to the SVM.

Dual problem of support Vector regression

Here we use Lagrange multipliers to simplify conditional optimization problems, using two Lagrange multiplier α corresponding to two ξ:

Next, we write the objective function and condition to be optimized as Lagrange function, then make the derivative of it, then replace it with the KKT condition, which is very similar to the previous introduction of SVM, and gives only a few important results:

We compare the dual SVM and the dual SVR to illustrate the fact that these duality problems are traceable.

After solving the above steps, we want the final result to be sparse, so let's see when beta is 0.
When the data in the tube, the two error variable ξ are 0, then the relaxation condition (complementary slackness) to obtain two α 0, so that the final β is 0, you can guarantee the sparse characteristics of support vectors, Only the data that is outside the tube and just on the tube is the support vector.

A retrospective linear model of the kernel model

In the linear model introduced initially, we introduce the PLA algorithm, linear regression (called Ridge Regression with regular term) and logistic regression, which correspond to three different error functions respectively. In the recent blog post, we introduced the linear soft-interval SVM, which is also a solution to the linear problem, which is solved by two times programming. If we want to solve the problem of regression, we also introduce the support vector regression model, which is solved by the error of tube and two times planning.

Kernel model

Kernel model we introduce SVM and SVR, both of which use two-time programming to solve the duality problem of simplification. We also introduce the method of turning linear regression into kernel, that is, using the representation theorem to deduce the form of its kernel, and the nuclear logistic regression is also a broadly similar method. We also use the probability SVM to do the SVM first, and then do the logistic regression to fine-tune.

For these models, we seldom use the PLA, linear SVR because it is less effective than the other three linear models. and Kernel ridge regression and kernel logistic regression is also not commonly used, because its coefficients are mostly not 0, so in the forecast time will cost a lot of meaningless calculation.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage

"Machine Learning Basics" Support vector regression

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More