[note] Study on SMO algorithm in support vector Machine (SVM) (a) Theoretical summary

Last Update:2017-02-26 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Preface

I have recently reviewed the support vector Machine (SVM) again. In fact, the individual feeling of SVM can be divided into three parts:

1. SVM theory itself: includes the maximal interval Hyper-plane (Maximum Margin Classifier), Lagrange duality (Lagrange duality), support vector (supported vectors), the introduction of the kernel function (Kernel), Soft interval Optimization (Outliers) of relaxation variables, minimum sequence optimization (sequential Minimal optimization), etc.

2. Nuclear method (Kernel): In fact, the development of nuclear methods can be independent of SVM, nuclear methods in many other algorithms will also be applied.

3. Optimization theory: The Minimum sequence optimization (sequential Minimal optimization) is mainly introduced here, and the development of optimization theory is independent of SVM.

2. SVM Theory Foundation

The theoretical basis of SVM can be consulted in the summary of the previous blog post: SVM summarization of support vector machine .

For a simple summary of support vector machines (SVM):

1. Maximum Margin Classifier

2. Lagrange duality

3. Support Vector 4. Kernel

5. Outliers

6. Sequential Minimal optimization

The individual feels that SMO can be divided into two parts:

(1) How to select the target working set at each iteration, that is, which two Lagrange multipliers to iterate.

(2) How to update iterations of the chosen working set (Lagrange multiplier).

3. Initial version of SMO (platt,1998)

SMO is to solve this convex two-time planning problem, here C is a very important parameter, it is essentially used to compromise the experience risk and confidence risk , the greater the C, the greater the confidence risk, the less the experience risk, and all the Lagrangian multipliers are confined to the C as the side length of the large box. The presence of SMO makes it unnecessary to resort to expensive third-party tools to solve this convex two-time programming problem, which is now much improved, and this section first introduces its original form and thought.

SMO is John C. Platt of Microsoft, sequential Minimal optimization:a Fast algorithm for Training support Vector Mac Hines, the basic idea is to push the Chunking method proposed by Vapnik in 1982 to the extreme, that is, by decomposing The original problem into a series of small-scale convex two-time programming problems, the solution of the original problem is obtained, Each iteration optimizes only the working set of 2 points, and the SMO algorithm chooses two Lagrange multipliers simultaneously to fix the other Lagrangian multipliers to find the optimal values of the two Lagrangian multipliers, until the stop condition is reached.

(1), kkt conditions

The SMO is based on the kkt condition of the c-svc, which is the KKT condition of the following:

which

In fact, the above conditions are KT complementary conditions, SVM learning-soft interval optimization article, like a conclusion:

The information that can be obtained from the above formula is: At that time, the relaxation variable, at this time:, the corresponding sample point is the wrong point; at that time, the relaxation variable is zero, at this time, the corresponding sample point is the internal point, that is, the classification is correct and away from the maximum interval classification of the super-plane of those sample points The corresponding sample point is the support vector.

(2), convex optimization problem stop condition

For convex optimization problems, an appropriate stop condition is always required at the time of implementation to end the optimization process, and the stop condition can be:

1, monitoring the growth rate of the target function, when it is below a certain tolerance, stop training, this condition is the most straightforward and simple, but the effect is not good;

2, monitoring the kkt conditions of the original problem, for the convex optimization they are the necessary and sufficient conditions for convergence, but because the KKT condition itself is relatively harsh, it is also necessary to set a tolerance value, that is, all samples within the tolerance of the value of the KKT to meet the condition that the training can be concluded;

3, monitoring the feasible clearance, it is the original objective function value and the dual target function value gap, for convex two times the gap is zero, with the first-order norm soft interval as an example:

The difference between the original objective function and the dual objective function is:

Define ratios: You can use this ratio to reach a tolerable value as a stop condition.

(3), SMO thought

Following the decomposition thought, the size of fixed "chunking working Set" is 2, each iteration only optimizes the minimum subset of two points and can obtain the analytic solution directly, the algorithm flow:

(4), containing only two langrange of the analytic solution of the multiply sub-

To describe the convenience of defining the following symbols:

So the objective function becomes:

Note the first constraint: that can be treated as a constant, having (as a constant, we do not care about its value), the equation is multiplied at the same time, get (for constant, its value is, we do not care about it,). If you replace with the above, you get an extremum problem that contains only the variables:

The problem is simple, and the partial derivative is obtained:

To be, bring into the formula are:

Bring in, use, denote error term (it can be imagined that even if the classification is correct, the value may be very large), (is the original space to the feature space mapping), here can be regarded as a measure of two sample similarity distance, in other words, once the kernel function is selected means you have defined the element in the input space similarity .

Finally, we get the iterative formula:

Note the second constraint-the powerful box: this means that it must also fall into this box, taking into account two constraints, more intuitively:

And the case of a different number

And the case of the same number

You can see that two of them are located in a box with a side length of C and in the corresponding line, so for the bounds, the following is true:

Arrange the following formula:

And because, after the elimination to get:

(5), heuristic selection method

Depending on the stop condition you choose, you can determine how the selection point can contribute the most to the convergence of the algorithm, such as using the method of monitoring the feasible gap , one of the most straightforward choices is to first optimize those points that most violate the KKT condition , the so-called violation of the Kkt condition refers to:

From the preceding stop condition 3, it is known that the point that contributes most to the feasible gap is those

among them ,

High-value points, which cause the feasible gap to become larger, should be optimized first for the following reasons:

1, when meet KKT conditions: instant,

When violating the KKT condition: instant, so

It can be seen that the feasible gap becomes larger due to the violation of KKT conditions;

2, when meet KKT conditions: instant,

When violating KKT conditions: Instant

If so, where

It can be seen that the feasible gap becomes larger due to the violation of KKT conditions;

3, when meet KKT conditions: instant,

When violating KKT conditions: instant, and, where

It can be seen that due to the violation of KKT conditions will lead to a larger feasible gap.

The heuristic selection for SMO has two strategies:

Heuristic Selection 1:

The outermost loop, first, selects a multiplier that violates the KKT condition in all samples as the outermost loop, selects another multiplier with "Heuristic selection 2" and optimizes the two multiplier, then selects a multiplier that violates the KKT condition from all non-boundary samples as the outermost loop, using "Heuristic selection 2" Select another multiplier and perform the optimization of the two multiplier (the choice of the non-boundary sample is to improve the chances of finding a point that violates the KKT condition), and finally, if there are no samples in the non-boundary sample that violate the KKT condition, then the whole sample is searched, Until there is no multiplier to change in all samples or other stop conditions are met.

Heuristic Selection 2:

The selection criteria for the inner loop can be seen in the following form:

To speed up the iteration speed of the second multiplier, it is necessary to make the largest, and nothing on the article can be done, so can only make the largest.

Determine the second multiplier method:

1, first in the non-bounded multiplier search to make the largest sample;
2, if not found in 1, the non-bounded multiplication samples are searched from random position;
3, if not found in 2, then the entire sample from a random location (including bounded and non-bounded multiplier).

(6), about the two-multiplier optimization of the description

By the formula

Know:

So for this single variable two function, if its second derivative, then two function openings downward, you can update the multiplier with the above iterative method, if, the objective function can only get the extremum on the boundary (at this point two function openings upward), in other words, the SMO to be able to handle any value of the case, then the following formula:

1, When:

2, When:

The multiplier is taken to get the value of the objective function in two cases: and. Obviously, in which case the value of the target function is the largest, the multiplier moves where it is, and if the difference in the target function is within a specified precision, the optimization is not progressing.

In addition, it is found that each iteration needs to calculate the output to get, so also to update the threshold value, so that the new multiplier, meet the KKT condition, consider, at least one in the bounds, it needs to be satisfied, so the iteration can get:

1. If it is located in the territory, then:

And because:

So there are:

On both sides of the equation, the same multiply-move term:

；

2. If it is located in the territory, then:

；

3, the establishment, all in the boundary, then: the condition 1 and the condition 2 the value is equal, whichever one;

4, set, are not in the bounds, then: The value is the case 1 and the case 2 of any value.

(7), improve the speed of SMO

In terms of implementation, there are areas where the standard SMO can improve speed:

1, can use the cache place as far as possible, for example, the cache kernel matrix, reduces the repetition computation, but increases the space complexity;

2, if the kernel of SVM is a linear kernel, can be directly updated, after all, the cost of each calculation is higher, so you can use the old multiplier information to update, as follows:

, examples of this nature can be found in SVM learning--coordinate desent Method.

3, attention can be parallel points, with the parallel method to improve, for example, can use MPI, the sample is divided into several parts, in the search for the largest multiplier can now find the local maximum point, and then find the global maximum point, and if the stop condition is to monitor the dual gap, Then we can calculate the local feasible gap on each node, and finally, the local feasible gap is accumulated in the master node to obtain the global feasible gap.

There are many references to the improvement of the standard SMO, such as the use of "maximal violating Pair" to heuristic selection of the multiplier is a very effective method, and the use of "Second Order Information" method, I think the ideal algorithm should be: The convergence rate of the algorithm itself can be greatly improved, but also the degree of parallelism of the algorithm is higher.

4. Version of the SMO update (fan,2005)

As mentioned earlier, SMO can be divided into two parts:

(1) How to select the target working set at each iteration, that is, which two Lagrange multipliers to iterate.

(2) How to update iterations of the chosen working set (Lagrange multiplier).

How to choose a working set is an important part of the SMO algorithm because different choices can lead to different training speeds.

Rong-en fan et paper "Working Set Selection Using Second Order information for Training support Vector machines" in 2005 describes each iteration There are several different working set selection methods.

The first is to release the objective function that SMO needs to optimize:

(1) Algorithm 1 (Smo-type decomposition method)

(2) WSS 1 (Working set selection via the "maximal violating pair")

This working set selection was introduced in 2001 by Keerthi and others, and was used in the LIBSVM released in 2001.

The working set selection can be derived from the kkt condition of the (1) formula: Assuming that the existence vector is a (1) solution, then there must be real numbers and two non-negative vectors to make the following form:

Among them, is the gradient of the objective function.

The above conditions can be overridden as:

Further, there

Make

The conditions for the existence of an optimal solution for the objective function are:

As the above definition of "violating pair" can be seen, the maximum violation (6) of the conditions of {i, J} pair is the best choice for working set, because we need to these most violations (6) of the {I, J} pair do update iterations, Let them meet the requirements of (6), they will gradually let the objective function to get the best value. The specific theoretical theorem is as follows:

Interestingly, the {i, J} pair obtained by selecting the maximum violation of the Kkt condition {i, J} pair and "First order approximation of the objective function" is consistent. That is, the {i, J} pair obtained through WSS 1 satisfies:

By definition, the objective function of (8a) is the first order approximation to find the optimal solution:

Among them, is due to, while (8b) and (8c) are due. Since (8a) is a linear function, the value of the target function is avoided to be infinitely small.

At first glance, the (7) seems to need to traverse all Lagrange multipliers to find the optimal {i,j} pair, however,WSS 1 can find the optimal value within the linear time complexity. The proof is as follows:

Proof

(3) A New working Set Selection

The above is optimized using the first order approximation of the objective function as a substitute, so we can go further and use the second approximation of the objective function as a substitute for optimization:

(4) WSS 2 (Working Set selection using second order information)

The following theory proves that the optimal value problem in the (11) formula can be solved effectively according to WSS 2 :

(5) non-positive definite Kernel matrices

The previous approach did not cover the situation, in this case, Chen et al. in 2006 years, the solution was given:

(6) WSS 3 (Working Set selection using second order Information:any symmetric K)

Then, using WSS 3 to select Working set for the Smo-type decomposition method, the steps are:

(7) Algorithm 2 (an Smo-type decomposition method using WSS 3)

[note] Study on SMO algorithm in support vector Machine (SVM) (a) Theoretical summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More