Optimal interval classifier problem
This course outline:
1. Optimal interval classifier
2. Original optimization problem & dual optimization problem (kkt condition)
3. dual problem of SVM
4. Nuclear method (next lecture)
Review:
Symbols changed in the support vector machine:
Output Y∈{-1,+1}
The assumed value of the H output is also changed to { -1,+1}
G (Z) = {1, if z>=0; -1, if z<0}
HW.B (x) =g (wtx+b), where B equals the original θ0,w equivalent to the original θ to remove the remainder of the θ0, the length is n-dimensional. The Intercept B is presented to facilitate the derivation of support vector machine.
function Interval :
A hyper-planar (w,b) and a specific training sample (X (i), Y (i)) correspond to the function interval defined as:
The parameter (w,b) defines a classifier, for example a linear dividing line is defined.
If y (i) = 1, in order to obtain a large function interval, wtx (i) +b >> 0 need to be made;
If Y (i) =-1, in order to obtain a large function interval, need to make wtx (i) +b << 0
If Y (i) (WTx (i) +b) > 0 means that the classification results are correct
The function interval for a hyper-plane (w,b) and the entire training set is defined as:
That is, the function interval relative to the entire training set is defined as the worst case scenario for all the function intervals relative to the sample (as mentioned above, the longer the boundary distance from the sample, the better the result).
Geometry interval:
The geometry interval is defined as:
This definition is similar to the function interval, and the difference is that the vector w is normalized. Also, it is desirable to have a geometric interval as large as possible.
Conclusion: If | | w| | =1, the function interval equals the geometric interval. More generally, the geometric interval is equal to the function interval divided by | | w| |.
The geometry interval for a hyper-plane (w,b) and the entire training set is defined as:
Similar to the function interval, take the smallest geometric interval in the sample.
Properties: You can scale W and b arbitrarily, as any scaling W and B will not change the position of the wtx+b=0. This nature has brought great convenience in the subsequent discussions.
1. Optimal interval classifier
The optimal interval classifier can be regarded as the predecessor of the support vector machine, and is a learning algorithm, which chooses the specific W and b to maximize the geometrical interval. The optimal classification interval is an optimization problem such as the following:
That is, select Γ,w,b to maximize gamma, while satisfying the condition: the maximum geometry interval chosen must ensure that each sample has a geometry interval of at least γ. That is, to find a super-plane, while the positive and negative samples are separated, so that the distance between the super-plane to the positive and negative samples as large as possible.
Because W and b can be arbitrarily scaled, constraints | | w| | =1, making the function interval equal to the geometric interval. But the constraint itself is a very bad non-convex constraint. To solve the parameter w on a sphere surface, if you want a convex optimization problem, we must ensure that the local optimal value search algorithm such as gradient descent algorithm will not find the local optimal value, but not the convex constraints can not meet this condition, so need to change the optimization problem.
In order to solve the above problems, the following optimization problems are proposed:
Dividing the function interval by | | w| | The value is maximized, and this value is actually a geometric interval, just another representation of the previous optimization problem. The optimization goal is to maximize the geometric interval, while ensuring that the function interval is not less than γ^, which is the maximum γ^, but the γ^ upper limit is the minimum function interval value.
To add a constraint to W: γ^ = 1, that is, the function interval is 1, even if the minimum value of the formula is 1. This is an implied scaling condition. By adding this constraint to the second optimization problem, the final optimal interval classifier is obtained:
This is a convex optimization problem, and there is no local optimal value, and the global optimal value can be found by gradient descent.
2. Original optimization problem & dual optimization problem (kkt condition)
Original question
If the value of the following type is required:
That is, minimize the function f (w), and Meet the Constraints hi (w) = 0, you can write hi to 0 vectors.
To create a LaGrand day operator:
That equals the linear combination of the original objective function plus the limiting function, where the parameter β is called the Lagrange multiplier.
Then, if the partial derivative is set to 0, the solution can be obtained:
To solve the above two partial derivative equations, we can check whether the obtained solution is the minimum value.
The general form of Lagrange multiplier method, also known as the original problem :
At this point, the LaGrand day operator is:
At this time α and β are Lagrange multipliers, which define:
The p in the previous formula indicates the original problem (primal),
If w violates the constraint, that is, GI (w) >0 or Hi (w)!=0, the upper form becomes:
In the above formula from (1) to (2) is interpreted as: if the GI (w) >0, so long as the like Infinity, Θp (W) will be infinitely large, similarly, if Hi (w)!=0, as long as the βi corresponding to Infinity (Hi (w) >0) or Infinitesimal (Hi (w), <0), Θp (W) will be infinitely large.
Conversely, if w satisfies the constraint, then θp (w) = f (w), so it is possible to:
Then, the min F (w) is the value of the formula, defined as p*:
Duality problem
Defined:
The maximum value is given to the dual optimization problem, which is defined as d*:
From the formula, the duality problem is to change the order of the Min,max in the original problem.
You can get:
The original problem and the duality problem obtain the same solution condition:
Make f a convex function (the Hessian matrix of a convex function is a semi-definite, h>=0, which is a bowl-like function with an open face).
Suppose Hi is an affine function, i.e.
Suppose that GI is strictly executable, that exists W, so that for all I,gi (W) <0
Under the above conditions, there is w*,α*,β*, wherein w* is the solution of the original problem, α*,β* is the solution of the dual problem, and:
Also, you need to meet the criteria:
These conditions are called KKT complementary conditions .
Implied conditions in KKT:
If αi>0 = GI (w*) = 0, but in general αi>0 <=> gi (w*) =0
When GI (w*) = 0, called GI (w*) is the activity constraint.
3. dual problem of SVM
Optimal interval classifier:
To rewrite the constraint to:
And through the kkt condition, αi>0 = GI (w,b) = 0, which is the activity constraint
GI (w,b) =0 <=> y (i) (WTx (i) +b) = 1, that is, the function interval is 1
Give examples such as:
The circle and fork in the figure are positive and negative, and the solid line is the dividing line determined by the w,b, which is the line formed by the point where the function interval is 1. The function interval of three samples is 1, and the function interval of other samples is greater than 1.
By kkt conditions, the Lagrangian multiplier corresponding to the sample function interval of 1 is generally not equal to 0, i.e. αi>0. The sample with a function interval of 1 is called a support vector . The number of support vectors is very small, so the majority of the αi=0, then the inverse can be obtained, αi=0, the corresponding sample is not a support vector.
Establishment of LaGrand day operator for optimal interval optimization problem:
Since this problem is constrained by inequalities, there is no beta.
In order to minimize the LaGrand day operator, because it is a function of w,b, the partial derivative of w,b is set to 0.
Launch:
W is the linear combination of the input eigenvectors. For the B bias guide:
Substituting w into the LaGrand day operator, that is, substituting | | w| | =WTW, after simplification, gets:
According to the result of the biased guide to B, the last item is 0, which obtains:
To represent the upper type as W (α), the duality problem is:
All optimization problems are for W, and B is just a parameter.
In order to solve the duality problem, find the parameter α*, and find the α, can be obtained by the above W formula to find the W, the Alpha and W, easy to find B, because W determines the slope of the super-plane, then according to the optimal interval, the alpha and w into the original problem, it is easy to find a B, as follows:
The intuitive interpretation of this formula is to find the worst samples (the nearest positive and negative samples), depending on their location, to the location of the super plane.
4. Nuclear method
The entire algorithm above can be expressed as a form of inner product:
X (i) TX represents the and of the inner product of the new input x and the training sample X.
In SVM feature space, because the dimension of training samples may be very high, the kernel method can efficiently calculate the representation of the inner product, but only for some specific feature spaces.
Explore the entire SVM calculation process, all the steps can not directly calculate x (i), and by calculating the inner product of the eigenvector to obtain results, so the kernel method is introduced.
Another property of the algorithm is that, since the sample of the function interval is only a small fraction of the training set, most αi=0 compute a small amount when calculating w.
Machine Learning-Stanford: Learning note 7-optimal interval classifier problem