Initial solution of SVM support vector machine-Lagrangian and dual algorithm

Last Update:2015-03-09 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In many places the SVM is very obscure, not easy to understand, recently saw a good blog post well written, coupled with their own understanding, re-comb the points of knowledge

http://blog.csdn.net/zouxy09/article/details/17291543

First, the introduction

SVM is a classifier. We know that the purpose of classification is to learn a classification function or a classification model (or a classifier) that maps data items in a database to one of a given category, which can be used to predict unknown categories.

For the support vector machine used for classification, it is a classification model of two classification. In other words, given a sample set containing both positive and negative sample points, the purpose of the support vector machine is to find a super plane to divide the sample, and the positive and inverse examples in the sample are separated by the super-plane, but it is not simply divided, the principle is to make the interval between the positive and inverse examples is the largest. The goal of learning is to find a categorical hyper-plane wx+b=0 in the feature space, which is determined by the normal vector W and intercept B. The categorical hyper-plane divides the feature space into two parts, one is the positive class and the other is the negative class. The side of the normal vector is a positive class, and the other side is a negative class. Here are two points to note, the category is (+1,-1) and before the logical regression algorithm is more than (+1,0) there is no special reason, mainly for the convenience of calculation, there is no limit to the super-plane how many dimensions, as long as the data dimension is one dimension lower.

Let's start with a simple example:

The black line is the best one to see, but it's not so easy to understand if the dimension is rising, so mathematical modeling is needed.

Two, the linear can divide the SVM and the hard interval maximization

SVM is trying to find a super plane to split the sample, and the positive and inverse examples in the sample are separated by the super-plane, but it is not a very perfunctory and simple separation, but do your best to make the interval between the positive and the counter examples the largest . This results in a more credible classification and a good predictive ability for unknown new samples .

Our goal is to look for a hyper-plane, so that the points closer to the super-plane can have greater spacing. That is, we do not consider that all the points must be away from the hyper-plane, we are concerned about the super plane can make all points from its nearest point has the maximum spacing. And the closest point to the hyper-plane is the support vector.

Suppose we have N training samples {(x1, y1), (x2, y2), ..., (xN, YN)},x is a D-dimensional vector, and Yi? {+1,-1} is the label for the sample, representing two different classes, respectively.

Here we need to train with these samples to learn a linear classifier (super plane): F (x) =sgn (wtx + b), i.e. wtx + B greater than 0, output + 1, less than 0 of the time, output-1. SGN () indicates a symbol. and G(x) =WTx + b=0 is the categorical hyper-plane we are looking for, as shown in. What did you say we were going to do? We need this hyper-planar maximum separation of these two categories. That is, the distance to the nearest sample of the two classes is the same, and the maximum. In order to better illustrate, we found two in the hyper plane parallel to and equidistant from the plane: H1:y = wtx + b=+1 and h2:y = wtx + b=-1.

Well, then we need two conditions: (1) There is no sample between the two planes; (2) The distance between the two planes needs to be maximum. (For any H1 and H2, we can normalized the coefficient vector Wso that the right side of the H1 and H2 expressions are +1 and 1 respectively).

First Look at the condition (2). We need to maximize this distance, so there are some samples in these two lines, they callSupport Vectors(we'll talk about their importance later.) So what is its distance, and the way to find the distance of two parallel lines, such as AX+BY=C1 and AX+BY=C2, then their distance is |c2-c1|/sqrt (x^{^{2+y ^{2 w Span style= "COLOR: #ff0000" >h1:w1x1+w2x2=+1 and H2:w1x1+w2x2=-1, the distance between H1 and H2 is |1+1|/sqrt (w12+w12) =2/| | W | | we need to maximize margin=2/| | w| | we should minimize | | W | | , it looks so simple oh. At the same time we also need to meet the conditions (1), that is, to meet no data points distributed at H _{1 _2:}}}}

That is, for any positive sample yi=+1, it should be on the right side of H1, that is, to ensure that: y= wTx+ b>=+1. For any negative sample yi=-1, it should be on the left side of H2, that is to say: y = wTx + b<=-1. These two constraints can actually be combined into the same equation: Yi (wTxi + B) >=1. (this explains why the sample is +1,-1)

So our problem becomes:

The two formulas represent, maximizing the distance between the support vector and the hyper-plane, without sample points in the support plane +1,-1 area

The following is the optimization problem with constrained conditions.

This is a convex two-time planning problem. What do you mean convex? A convex set is a set of points where any two points are connected to a straight line, and the point on the line is still inside the set, so that "convex" is very image. For example, for a convex function (in the mathematical representation, the satisfying constraint is the affine function, which is the linear ax+b form), the local optimality is the global optimal, but not for the non-convex function . The quadratic representation of the objective function is the two function of the independent variable.

Well, since it is a convex two-time planning problem, the optimal solution can be obtained through some of the optimized tools of QP (quadratic programming). So, our problem is solved at this point. Although this problem is really a standard QP problem, but it also has its special structure, through the Lagrange duality transformation to the dual variable (dual variable) optimization problem, you can find a more efficient way to solve, And often this is much more efficient than optimizing directly using a common QP optimization package. In addition to the conventional method of solving the QP problem, we can also apply Lagrange duality to get the optimal solution by solving the duality problem, which is the dual algorithm of support vector machine under the linear condition, the advantage of which is that the duality problem is often easier to solve, and the two can naturally introduce the kernel function. Then the problem of nonlinear classification is generalized.

One of the functions of the kernel is simply to map a low-dimensional nonlinear data to a high-dimensional linear data.

Three, dual optimization problem

3.1, Dual problem

In the constrained optimization problem, the original problem is often converted to dual problem by Lagrange duality, and the solution of the original problem is obtained by solving the duality problem. The reference to this principle and derivation [3] speaks very well. Everyone can refer to the following. This is only how the dual problem is manipulated. Let's say that our optimization problem is:

Min f (x)

s.t. Hi (x) = 0, I=1, 2, ..., n

This is an optimization problem with an equality constraint. We introduce Lagrange multipliers to get the Lagrangian function as:

L (x, α) =f (x) +α1h1 (x) +α2h2 (x) +...+ΑNHN (x)

(Lagrange functions incorporate constraints into expressions for easy processing later)

Then we will calculate the Lagrange function to the extreme value of x, that is, the derivation of X, the derivative is 0, you can get the α about X function, and then substituting the Lagrange function into:

Max W (α) = L (x (α), α)

At this time, the optimization problem with equality constraint becomes the optimization problem with only one variable α(multiple constraints are vectors), then the solution is very simple. The same derivative is equal to 0, the solution of α can be. It is important to note that we call the original problem primal problem, the converted form is called dual problem. It is important to note that the original problem is minimized and transformed into a duality problem and becomes the maximum value. For inequality constraints, the same operation is actually true. Simply put, by adding a Lagrange multiplier (Lagrange multiplier) to each constraint, we can integrate the constraints into the target function, which makes it easier to solve the optimization problem. (In fact, there are a lot of interesting things, you can refer to more posts)

3.2. Dual problem of SVM optimization

For SVM, as mentioned earlier, its primal problem is in the following form:

By introducing Lagrange multipliers in the same way, we can get the following Lagrangian functions:

(It feels like +b was wrong)

, then the L ( W , b, α ) respectively The extrema of W and b W , B, Span style= "Color:rgb (51,51,51)" > α ) pair W and B have a gradient of 0:? L/? W =0 and? L/? b =0, you also need to meet the α >=0. To solve the equation where the derivative is 0, you can get:

And then substituting the Lagrangian function, it becomes:

This is dual problem (if we know Alpha, we know w.) Conversely, if we know W, we can also know α). At this point we become the greatest for the α , that is, the optimization of the dual-variable α (no variable w, B, only α). When the optimal α* is obtained, the same formula can be used to derive the w* and b*, and finally the separation of super plane and categorical decision function is obtained. That is to train the SVM well. Then a new sample x can be categorized as follows:

Here, in fact a lot of like are all 0, which means W W and B to do a linear operation, and then see whether the result is greater than 0 or less than the judgement of the positive or negative example. Now with like We don't need to ask for W , Just make the inner product of all the samples in the new sample and training data, and that is only support vector like not 0 , other cases like are 0

Relaxation vector and soft interval maximization (good)

The situation we discussed earlier is based on the assumption that the distribution of samples is more elegant and linearly separable, in which case a near-perfect hyper-plane can be found to separate the two types of samples. But what if you encounter either of the following conditions? Left, a sample of negative Class A is not very gregarious, run to the right side of the way, this time if the above to determine the classification of the method, then you will get to the left of the red This classification boundary, well, it seems not very good, as if the whole world will be a. There is also the case of the right figure. A point of the class and a point of the negative class ran to the other people's door, this time can not find a straight line to separate them, then how to do? Are we really going to give in to these 0-ding, less obedient outliers? Just because their imperfect change our original perfect interface will not be worth the candle? But they have to be considered, so how can we compromise?

For the above-mentioned data points that deviate from the normal position, we call it outlier, it is possible to collect the training samples of the noise, it may be a certain standard data of the Uncle dozing wrong, the positive sample is labeled as a negative sample. In general, if we ignore it directly, the original separation of the super-plane is still very good, but because of the appearance of this outlier, resulting in the separation of the super-plane has to be squeezed crooked, and margin correspondingly smaller. More seriously, of course, if this outlier of the right figure is present, we will not be able to construct a super plane that separates the data linearly.

To deal with this situation, we allow the data points to deviate to a certain extent from the superelevation plane. That is, allow some points to run between H1 and H2, that is, they will be less than 1 of the interval to the classification surface. Such as:

ξ _{I: The relaxation factor introduced is to allow some relative offsets between the sample points in the Hyper-plane}

_{C: Penalty factor: If there is a certain offset, then the factor is added to the Lagrange equation, in the request | | w| | Increase penalty for minimum value calculation}

In particular, the original constraint becomes:

At this point, we add a penalty to the target function, and the new model becomes (also known as the soft interval):

After introducing nonnegative parameter ξi (called relaxation variable), the function interval of some sample points is allowed to be less than 1, that is, in the maximum interval, or the function interval is negative, that is, the sample point is in the other's area. After relaxing the restriction, we need to readjust the target function to punish the outliers, and the second item after the objective function indicates that the more outliers, the larger the target function value, and the smaller the target functions we are asking for. Here c is the weight of outliers, the greater the C, the greater the impact of outliers on the target function, that is, the more you do not want to see outliers. At this time, the interval will also be very small. We see that the objective function controls the number and extent of outliers, so that most sample points still adhere to the constraints.

At this time , after the same derivation process, our The dual optimization problem becomes :

At this point, we find that there is no parameter ξi, the only difference from the previous model is that like has more like<=c constraints. It should be recalled that the B's evaluation formula has also changed, and the results of the change are described in the SMO algorithm.

SVM Support Vector Machine-Lagrange, the initial solution of dual algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More