I am the porter: http://my.oschina.net/wangguolongnk/blog/111353
The principle of support vector machine is very simple, which is the VC dimension theory and the minimization of structural risk. In reading the relevant papers, found a lot of articles are vague, even the "A Tutorial on support Vector machines for Pattern recognition" This article on the Lagrange condition extremum problem of the dual transformation is just a stroke, Make a lot of people feel very confused. Below I will give a detailed derivation of the linear sub-conditions of SVM.
As shown, there is a bunch of positive and negative samples of the training data, labeled:, suppose there is a super-planar h:, you can divide these samples correctly, and there are two super-planar H1 and H2 parallel to H:
The positive and negative samples closest to H just fall on H1 and H2 respectively, so the sample is the support vector. All other training samples will be located outside of H1 and H2, i.e. the following constraints are met:
A unified formula is:
(1)
The distance between the H1 and H2 is as follows:
The task of SVM is to find such a super plane h to divide the sample into two parts without error, and make the distance between H1 and H2 the largest. To find such a hyper-plane, you simply maximize the interval margin, which is minimized. The following conditional extremum problems can be constructed:
(2)
For the conditional extremum problem constrained by inequality, it can be solved by Lagrange method. The Lagrange equation is constructed by multiplying the non-negative Lagrangian coefficients with the constrained equations and subtracting them from the target function. The Lagrange equation is then obtained as follows:
(3)
which
(4)
Then the planning problem that we are dealing with becomes:
(5)
The above formula is the expression of the Lagrange conditional extremum with strict inequality constraints. For this step of the transformation, many articles do not make more statements, or understand the deviation, thus affecting the reader's subsequent deduction. Here I will take a step-by-step derivation to solve the puzzle.
(5) is a convex programming problem, its significance is to the α-biased derivative, so that it equals 0 to eliminate α, and then the W and b to find the minimum value of L. It is difficult to solve the equation directly (5), by eliminating the Lagrange coefficients and simplifying the equations, which is useless to our problems. Fortunately, this problem can be solved by Lagrange duality, for which we make an equivalent transformation of (5):
The above is a dual transformation, which transforms the convex programming problem into a duality problem:
(6)
The significance is: the original convex programming problem can be converted to W and b to be biased, so that it equals 0 to eliminate W and B, and then the alpha to the maximum value of L. Here we will solve the (6) formula, for which we first calculate the partial derivative of W and B. The (3) formula has:
(7)
In order for L to get the minimum value on W and B, the two partial derivative of (7) is 0, so it gets:
(8)
Return (8) to the (3) type, you can get:
(9)
Then the (9) substituting (6) is:
(10)
Considering the (8) formula, our duality problem becomes:
(11)
This programming problem can be solved directly from the numerical method.
One point to note is that (2) The conditional extremum problem can be transformed into the (5) Type convex programming problem, which implies a constraint, namely:
(12)
This constraint is so derived, if (2) and (5) are equivalent, there must be:
Put (3) into the above formula, get:
Simplification gets:
(13)
Also because of the constraints (1) and (4), there are:
Therefore, to make (13) Form, only make:, thus get (12) type of constraint. The implication of this constraint is that if a sample is a support vector, its corresponding Lagrange coefficients are not 0; If a sample is not a support vector, its corresponding Lagrange coefficient must be 0. The majority of Lagrange coefficients are 0.
Once we have solved all the Lagrange coefficients from the (11) formula, we can pass (8)
The normal vector w of the optimal segmented surface H is computed. The partition threshold B can also be computed with support vectors using the (12) constraint. So we found the best H1 and H2, and this is the SVM we trained.
The detailed derivation process and annotations of support vector machine (SVM)