I am the porter: http://my.oschina.net/wangguolongnk/blog/111349
1. What is the purpose of support vector machines?
For support vector machines for classification, given a sample set containing both positive and negative sample points, the purpose of the support vector machine is to look for a super plane to segment the sample, separating the positive and inverse examples in the sample with the hyper-plane, but not simply by dividing it . The principle is to make the interval between the positive and the inverse examples the largest.
What is a hyper-plane? To put it simply, the super plane is the generalization of the straight line in the plane in the high dimensional space. So, for three-dimensional space, the super plane is the plane. For the higher dimensional space, we can only use the formula to express, but the lack of intuitive graphics. In summary, the hyperspace in n-dimensional space is n-1-dimensional.
The formula for the hyper-plane is. The W in the formula is a vector of coefficients that can be adjusted, and B is bias. Note Our expression habits, all vectors are column vectors, so in the first item the inner product of the vector w needs to be transpose.
Now consider the sample set {Xi,di},xi is the input feature, and DI is the classification of the sample. It is now stipulated that when the sample XI belongs to the first class, Di is 1, and when Xi belongs to the second class, Di is-1.
So, the linear divisible meaning is that a super plane can completely separate the two types of samples. The expression in a formula is:
You may now ask, so what if it's not linear and can be divided? The fact is that these will be dealt with later. Here we first discuss the linear sub-conditions, and then extend it to a linear non-divided case.
Now that we have a segmented hyper-plane for a linearly divisible set of samples, we now want to keep the maximum interval between the positive and negative samples by adjusting the w0 and b0 so that we get the optimal hyper-plane. In fact, in the process of operation, we maximize the distance from the nearest point of the super plane to the super plane . In other words, we want to keep the hyper plane as far away from the nearest point as possible. The distance from the nearest point of the hyper-plane to the positive sample is equal to the distance from the super-plane to the nearest point of the negative sample . Is this a coincidence?
Let's say we've found a super plane that is more distant from the nearest point of the positive sample than the nearest point of the negative sample, so the closest point to the hyper-plane is the nearest point in the negative sample. With our goal in mind, we also adjust the position of the super-plane so that it can be enlarged even if it sacrifices the distance from the nearest point of the positive sample. So the final result of the adjustment must be that the distance from the nearest point on the side of the plane is equidistant.
In order to visualize the interval between the positive and negative samples, we can define two extra-planar H1 and H2 (shown in the dashed line) on both sides of the split superelevation plane, respectively, through the sample points closest to the split super-plane from the positive and negative samples (the outer ring is added to the figure). From the above analysis, we can know that the ultra-planar H1 and H2 are equidistant from the segmented hyper-plane.
We define the points above the hyper-planar H1 and H2 called the support vectors. the interval between the positive and negative samples can be defined as the interval between the H1 and H2, which is the sum of the distance between the nearest positive sample point and the nearest negative sample point of the split plane.
As can be seen from the diagram, support vectors play a key role in dividing the position of the super plane. The support vectors are also exposed after optimizing the segmentation of the super-planar position, while the sample points after the support vectors are not critical to the classification. Why do you say that? The position of the super-plane is the same as the position of the original segmented super-plane, even if all the sample points outside the support vector are deleted and the optimal segmentation super-plane is found. To sum up is:
Support vectors contain all the information needed to focus on the division of the Super plane!
2. Representation of sample point to super-planar distance
How to find a little distance to the plane?
Now let's see what the coefficient vector w0 means. Recall that W0 is actually a super-planar normal vector!
So, for any sample point x, it can be represented as:
Where XP is the projection of x over the plane, R is the geometric distance (geometric interval) of x to the plane.
Set
Now defined by G (XP) is 0, then there is.
Now let's see, g (X) actually measures the distance from the sample point X to the Super plane, in | | w0| | In a constant case, the size of the G (x) absolute value reflects the size of the geometric interval R. We give G (x) a name called function interval. Note that the geometric interval R and the function interval g (x) are both positive and negative, representing the different sides in the super-plane.
3. Maximize the interval
Now that we know the representation of the function interval and the geometric interval, we need to maximize the support vector to the distance from the split plane, but at the very beginning we don't know which vectors are the support vectors.
Our goal is to maximize the support vector to the geometry interval r of the split hyper plane, rather than maximizing the function interval g (x), why? Because the coefficients of the super-plane equation can be increased or decreased with the same proportion, without changing the super plane itself. So | | w0| | is not fixed, which affects the size of the function interval g (x).
So what we need to maximize is the geometric interval R, which is equivalent to our fixed | | w0| |, then maximizes the function interval g (x). But we don't actually do that, the usual approach is to fix the function interval g (x) with an absolute value of 1, and then minimize | | w0| |. That is , we set the absolute value of the function interval g (x) of the support vector to the segmented super-plane to 1, and then minimize | | w0| |.
4. Formal presentation
Now we can formally express the problem. We need to minimize | | w0| |, which is the Euclidean norm that minimizes the super-plane weight vector w0. But are there any qualifications? Do you remember the last sentence of the previous verse?
"That is, we set the support vector to the function interval g (x) of the segmented hyper-plane to 1, and then minimize | | w0| | "
So minimize | | w0| | Is there a definite condition, how to express the restriction condition? The g (x) corresponding to the support vector is set to +1 or 1 (depending on which side of the split plane the support vector is, that is, a positive or negative sample), it also shows that for all positive sample points, G (x) is >=+1, and for negative samples, g (x) is <=-1.
Recall the definition of G (x):
,
We can write down the restrictions:
Now we can write the above questions more concisely:
Target function:
Limit:
1/2 is added for the convenience of calculation, n is the number of sample points.
Now that our first mission is over, we have transformed the problem of finding the optimal segmented hyper-plane into an optimization problem with a series of inequality constraints. This optimization problem is called the original problem. We do not solve it directly, but instead turn it into a duality problem. As for how to turn it into a duality problem, this is the content of the next few sections.
Optimality conditions with minimal equality constraints
The solution of support vector machine is to solve the problem of the previous section, which is the content of the optimization course.
Recalling the contents of the previous section, our goal is to find the minimum value of the function under a number of constraint conditions. in the original problem of the previous section, where the constraints are inclusive inequalities, this section first considers the simple question, which is to consider the optimization problem that contains only the equality constraints:
(1)
where f (x) is called the target function, and the following is a series of equality constraints. Recall that when there is no constraint exists, how should we look for the best advantage? The fact that x* is the most advantageous condition is:
If the function f (x) is a convex function, the condition is also sufficient.
Insert a description, if the function f (x) is a real value function,x is an n-dimensional vector, then f (x) the derivative of vector x is defined as:
Back to the current problem, when we look for the best of constraints, the existence of constraints reduces the scope of the search, but complicates the problem. in order to make the problem easier to handle, our approach is to incorporate the objective functions and constraints into a new function, the Lagrangian function, which is used to find the most advantageous.
To visualize this problem, we consider that the objective function is a function of three variables and has only one constraint :
(2)
Geometrically, the question above (2) is to look for the minimum value of the function from the surface. The optimal solution for the hypothetical problem (2) is. We now make the surface ω a smooth curve l through point x :(because the curve l is on the surface Ω, so naturally).
The t for the most advantage corresponds to t*. Because x* is the most advantageous on surface Ω, x* is also the most advantageous on the curve l , so t* is the most advantageous of the unary function, so at this point its derivative is 0. Through the chain rule we get:
This equation shows that at x* This point, the gradient vector of the function and the tangent of the curve L at x* are perpendicular. Because the curve l is arbitrary, the gradient vector and the surface ω are perpendicular.
Recalling the conclusion of higher mathematics, the direction is the normal direction of the surface Ω, so and necessarily in the direction of the same line, so there must be a constant μ*, there is.
We can write it in a more refined form. If we construct a two-tuple function, the above conclusion can be expressed as a constant μ*, so.
We call the constructed function a Lagrangian function, and the μ is called the Lagrange multiplier.
The introduction of Lagrange functions with only equality constraints can also be referred to wikipedia An example of the two variable functions in the.
The above is an analysis of a particular situation and contains only one constraint. So the general case with the equality constraint, which is the problem (1), we can also construct the Lagrangian function, but the expression is slightly different because it includes multiple equality constraints:
。
In other words, each equation constraint corresponds to a Lagrangian multiplier . Then x* is the most advantageous condition, the existence of the corresponding Lagrange multiplier μ*, so that the following two formulas are established:
(In fact the original question (1) of the constraints changed the way)
These two equations are the necessary condition for the best, of course, if the function is a convex function, these two formulas are sufficient conditions.
Now our goal is achieved, that is, the objective function and a series of equivalent constraints into a function (Lagrange function) inside, so that only the solution (3) and (4) The two formulas can find the most advantages, its advantages are self-evident. In the next section we will discuss the optimization problem with inequality constraints.
Searching for the lower bounds of optimal values
We will first introduce an optimization problem with inequality constraints, the standard form is as follows:
(1)
F (x) is the objective function, and the latter is a series of inequality constraints and equality constraints respectively.
We first define several concepts:
Feasible point (feasible solution): All points that satisfy the constraint X.
Feasible field: A set of points consisting of all feasible points, recorded as R. Formally written out is:
The most advantageous (optimal solution): satisfies the constraint (that is, within the feasible domain) and causes the objective function to reach the minimum point, as x*.
Optimal value: x*,p* = f (x*) is the optimal value if found.
After defining these concepts, we'll go on to the next section.
Similar to what is described in the previous section, which contains only equality constraints, we define Lagrange functions as follows:
Let's see, what's the difference between this and the Lagrangian function of the previous section? A series of inequalities constrain the corresponding items, so also a series of Lagrange multipliers. What needs to be emphasized here is that all λi must be greater than or equal to 0 ( and that is, the inequality constraint corresponds to the multiplier requirement greater than or equal to 0, which we remember as λ≥ 0, meaning that each is λi≥0). As for why this is required, it is natural to see it later.
Next we define an important function that we define Lagrange to even function (the Lagrange dual function) as follows:
(2)
So Lagrange to even function is the smallest value found in the function that is considered x. What's the point of finding this minimum value?
We write down the conclusion, this conclusion is very important, is the purpose of this section discusses:
The even function produced the original problem (1) a lower bound of the optimal value p*, i.e., for arbitrary λ≥0 and arbitrary μ, there are:
(3)
So how to prove (3)?
This proof step is very concise. Suppose x* is the optimal solution in the original problem (1), i.e. f (x*) = p*.
The last two lines are deduced to take into account that the x* is within the feasible domain R, so there must be, of course, the premise is λ≥ 0, which is why at the outset to do this rule.
How do we understand this inequality (3)? Here are two intuitive explanations:
Explanation One: Interpretation of the linear approximation
We first rewrite the question (1), that is, the problem (1) in a more compact way to express, first we define the function of the display:
We can also define another function of the display:
With the help of these two display functions, we can now rewrite the question (1) into an unconstrained form:
(4)
Let's take a look at this optimization problem (4) and the question (1) is equivalent? We can think of the next two items of (4) as the penalty function for the x that violates the constraint condition. The effect is "infinite" punishment for x that violates the inequality constraint, and once the penalty is equal to infinity. The effect is to punish the X that violates the equality constraint, and once the penalty is infinite. It is the same thing to optimize the objective function in (4) with the optimization of the objective function under the constraint condition (1), Is it not? Thatis, (1) and (4) These two problems are equivalent, but in (4) The constraints are fused to the target function.
Now let's Look Back (2), which is Lagrange's even function, which is also an optimization problem, and we compare the functions optimized by it with the functions optimized in (4), and rewrite them together:
Objective function in (2)
Objective function in (4)
Visible in the question (2) and the problem (4), we optimize the objective function difference is that the penalty is different, (4) The penalty is infinite, that is, once the constraint is violated, the infinite penalty is applied, and in (2) Our penalty is linear, that is, with GI (x) and Hi (x), The penalty is linearly variable. Therefore, (2) and (4) need to optimize the objective function is very different, with (2) to approximate (4) is very inaccurate. But we can see that for any u, any λ≥0 and any μ are:
(We limit λ to greater than or equal to 0)
So at any point, the value of the target function in (2) is less than the value of the target function in (4), so the optimal value found in (2) is definitely less than the optimal value found in (4). The same is true of the above mentioned (1) and (4), so the Inequality (3) is established.
EXPLANATION Two: The order of the Exchange Max and Min
We can first see:
Why does it have this result? What if x satisfies the constraint, that is, for all I and if we want to make it bigger by adjusting λ and μ ? Only let λ is all 0 (note that λ can only be greater than or equal to 0), thus eliminating the items less than 0, as for, no matter how the μ changes are not affected. So when X is in the feasible domain, the result of the upper formula is f (x). What if x violates the constraint? When doing the SUP operation, it is only necessary to set the multiplier of the corresponding item to +∞, and the other items corresponding to the multiplier set to 0, it can make the result of the whole equation become infinite.
So we can see that in the question (1) The constrained optimization problem and direct optimization are one thing, that is to say:
Now we have swapped the INF and SUP two operators in order, obviously:
We rewrite the (2) formula:
(2)
You can see the conclusion, that is, λ≥ 0 o'clock (3) Form:
(3)
Well, it's been a long day. We have a question about how inequality (3) comes about.
To summarize, the inequality (3) is described in words:
If we consider the Lagrangian function as a function of x and then remove the definite bounds (note: The exact bounds are removed in the entire domain, not just in the feasible domain, that is, the X is not constrained when the boundary is removed), then the resulting result is a lower bound of the optimal value of the original optimization problem (1).
As to what we are going to do with this result, we'll talk about it in the next section.
Duality problem
Recall the previous section on the following original questions:
(1)
We define the Lagrange pair even function:
Then we prove:, where p* is the optimal value of the original problem.
That means we found a lower bound to the optimal value of the original problem . Now that we've found a nether, obviously we're going to find the best nether. What is the best nether? is clearly the largest of all the nether. So we want to maximize, of course we have to remember that we need to limit. We formally write down the functions and constraints to be optimized:
(2)
In response to the original question (1), we call the above problem (2) The Lagrange duality problem (Lagrange dual problem). obviously, the optimal value of the dual problem d* is the optimal lower bound of the p* we can get, which is the closest one to p* in all the nether, and their relationship is:
(3)
We call this inequality a weakly dual property (Weak duality).
Naturally, we can draw up an important concept, the duality gap, which is defined as the optimal value of the original problem and the difference between the best (maximum) lower bound of the even function obtained by pulling a Lang day. It can be seen from the inequality (3) that the dual gap is definitely greater than or equal to 0.
So is it possible that in some cases the duality gap disappears? That is to say, the optimal value of the duality problem is equal to the optimal value of the original problem?
We will describe the Slater condition:
Slater conditions:
Presence X satisfies:
The Slater condition is that there is an X, so that the "less than equals" in the inequality constraint is strictly taken to the "less than" sign.
It can be proved that for convex optimization problems (refer to Wikipedia for convex optimization problems), if the Slater condition is satisfied, then:
This condition is called the strong duality property (strong duality).
The question below is, if the dual gap disappears, what interesting phenomenon will happen?
What happens if the dual gap disappears, that is, if the duality problem has the most merit λ*,μ* and the corresponding optimal value equals p*? Remember the process we proved in the previous section:
(4)
In the case where the duality gap disappears, all the non-equal signs in the middle become equal:
(5)
Note that both λ and μ in (5) have asterisks, indicating that they are the most advantageous of the dual problem. (5) There are two important equals signs that have been added.
What conclusions can we draw?
1. Let's look at the equals sign 1 First:
It shows that the optimal point of the original problem x* is the point at which the minimum value is obtained.
2. Let's take another look at the equals 2:
It illustrates the following:
Since we have limited every λi≥0, each of the above is a positive one. We can then conclude that:
(6)
The equation (6) is called a complementary condition , and we can put it in another way:
or write it in its equivalent form (inverse no proposition):
In other words, as long as one is not 0, the other must be 0!
Complementary conditions are of great significance. It shows that at that time, the x* is in the interior of the feasible domain, when the inequality constraint does not work, at this point, and the point is certainly the feasible domain boundary point (). This means that only positive constraints have a dual variable that is not 0 . And this is of great significance in support vector machines. Recall in the first section our final conclusion, the support vector machine to find the maximum interval of the super-plane can be attributed to an optimization problem:
Target function:
Limit:
So which inequality constraints correspond to a dual variable that is not 0? Obviously, only then can the dual variable of this constraint correspond to 0, and what does it mean? This means that the sample point XI of the constraint corresponds to the support vector! Other words:
Only support vectors correspond to Lagrange multipliers that are not 0!
Understanding the duality problem of SVM in depth