I.SMOAlgorithmPrinciple
The SMO algorithm is similar to some SVM improvement algorithms in the past. It breaks down the whole quadratic planning problem into many small problems that are easy to handle. What's different is that, only the SMO Algorithm breaks down the problem to the smallest possible scale: Each optimization only processes the optimization problem of two samples and uses the analytical method for processing. We will see that this distinctive method has brought a series of incomparable advantages.
For SVM, at least two samples must be optimized at a time (that is, the corresponding Laplace multiplier). This is because the existence of equality constraints makes it impossible for us to optimize a variable separately.
The biggest benefit of the so-called "minimum optimization" is that we can use an analytical method to solve every minimum-scale optimization problem, thus completely avoiding iterative algorithms.
Of course, such a "minimum optimization" won't ensure that the result is the final result of the optimized Laplace multiplier, but it will make the target function take a step toward the minimum value. We then perform the minimum optimization on the other radians until all the operators meet the kkt condition, and the objective function is minimized and the algorithm ends.
In this way, the SMO algorithm has two problems to solve: one is how to solve the optimization problem of two variables, and the other is how to determine which aspects of the Laplace multiplier are optimized first.
Ii. TwoThe problem of the optimization of the Laplace Multiplier(ChildProgramTakestep)
Here we may wish to set the first and second samples corresponding to the two currently optimized radians, for the two radians α 1 and α 2, without changing other multiplier, their constraints should be expressed as a line segment within the Square. :
Finding the extreme value of a function on this line segment is equivalent to a one-dimensional extreme value problem. We can represent α 1 with α 2 and obtain the unconditional extreme value for α 2. If the objective function is strictly concave, the minimum value must be at this extreme point (the extreme point is within the range) or in the range endpoint (the extreme point is outside the range ). After α 2 is determined, α 1 is determined. Therefore, we first find the upper and lower limits of the α 2 optimization interval, and then find the minimum value of α 2 in this interval.
From figure 1, we can easily obtain that the upper and lower limits of α 2 should be:
L = max (0, α 2-α 1), H = min (C, C + α 2-α 1), if Y1 is different from Y2;
L = max (0, α 2 + α 1-c), H = min (C, α 2 + α 1), if Y1 is the same as Y2;
If S = y1y2 indicates whether the two samples are of the same type
L = max (0, α 2 + S α 1-1/2 (S + 1) c), H = min (C, α 2 + S α 1-1/2 (S-1) c)
The Equality Constraints of α 1 and α 2 in this optimization are as follows:
Alpha 1 + S α 2 = α 01 + S α 02 = d
Below we derive the formula for finding the minimum value point α 2: Since there are only α 1 and α 2 variables to consider, the objective function can be written
Wolfe (α 1, α 2) = 1/2 K11 α 21 + 1/2 k22 α 22 + sk12 α 1 α 2 + Y1 α 1v1 + y2 α 2v2-α 1-α 2 + constant
Kij = K(Xi, XJ), Vi = Y3 α 03ki3 +... + Yl α 0lkil = UI + B0-Y1 α 01k1i-Y2 α 01k2i
The value 0 indicates the original value of the Laplace multiplier before the optimization.
Expression α 2 with α 1 and substitute it into the target function:
Wolfe (α 2) = 1/2 K11 (D-S α 2) 2 + 1/2 k22 α 22 + sk12 (D-S α 2) α 2
+ Y1 (D-S α 2) V1-D + S α 2 + y2 α 2v2-α 2 + constant
Evaluate α 2:
Dwolfe (α 2)/d α 2
=-Sk11 (D-S α 2) + k22 α 2-K12 α 2 + sk12 (D-S α 2)-y2v2 + S + y2v2-1 = 0
If the Wolfe function is always strictly concave, that is, the second derivative K11 + K22-2K12> 0, then the resident point must be the minimum point, and the unconditional extreme point is
α 2 = [S (K11-K12) D + y2 (V1-V2) + 1-s]/(K11 + K22-2K12)
After the relationship between D, V and α 0 and U is substituted, the unconditional advantages of α 2 represented by α 02, u1, U2 in the previous step are obtained:
α 2 = [α 02 (K11 + K22-2K12) + y2 (U1-U2 + y2-y1)]/(K11 + K22-2K12)
Let ETA = K11 + K22-2K12 as the second derivative of the objective function, Ei = UI-yi is the "error" of the I training sample, this formula can be written
α 2 = α 02 + y2 (E1-E2 )/
Unless the kernel function K does not meet the Mercer condition (that is, it cannot be used as the kernel function), there will be no negative value in the kernel function. But 0 = 0 can occur. In this case, we calculate the values of the two endpoints of the target function online segment, and correct the Laplace multiplier to the endpoint with a small target function:
F1 = Y1 (E1 + B)-α 1k (X1, X1)-S α 2 k (X1, X1)
F2 = Y2 (E2 + B)-S α 1k (X1, x2)-α 2 k (X2, X2)
L1 = α 1 + S (α 2-l)
H1 = α 1 + S (α 2-h)
Wolfel = l1f1 + LF2 + 1/2 l21k (X1, X1) + 1/2 l2k (X2, X2) + Sll1k (X1, x2)
Wolfeh = h1f1 + hf2 + 1/2 h21k (X1, X1) + 1/2 h2k (X2, X2) + Shh1k (X1, x2)
When the two endpoints obtain the same target function value, the value of the target function on the entire line segment will be the same (because it is concave), then there is no need, α 2 is corrected.
After determining the unconditional extreme value of α 2, consider the upper and lower limits, and the final α 2 is
Finally, α 1 is determined by equality constraints:
α 1 * = α 1 + S (α 2-α 2 *)
3. Select to be optimizedTest Method for Finding the vertex of the Laplace Multiplier
In fact, even if we do not use any vertex finding method, we only extract all the combinations of α I and α J in order for optimization, the objective function will continue to decline until any pair of α I, neither α-J can be further optimized, and the objective function will converge to the minimum value. We adopt some method of finding points to make the algorithm converge faster.
This test method first selects α 2, which is most likely to be optimized, and then selects α 1, which is most likely to obtain a larger correction step, for such α 2. In this way, we use two levels of loops in the program:
The inner-layer loop (subroutine examineexample) selects another sample for samples that violate the kkt condition to be paired with it for optimization (that is, to optimize their Laplace multiplier ), the selection is based on the maximum optimization step size for such a pair of samples. For one of them, the optimization step is | (E1-E2)/ETA |, but because of the time consumption of kernel function estimation, we only use | E1-E2 | to roughly estimate the possible step size. That is to say, to make | E1-E2 | the largest sample as the second sample. It should be noted that such step size estimation is relatively rough. The selected pair of samples sometimes cannot be "once and for all", but cannot be further adjusted, (for example, if it is 0, the quadratic form of the minimum optimization problem is only a semi-definite number ). In this case, we traverse all non-boundary samples (non-boundary samples are the samples where the Laplace multiplier is not on the boundary 0 or C) and continue searching for α 1 that can be paired with α 2 for optimization, if such samples cannot be found in non-boundary samples, traverse all samples. These two traversal operations start from a random position, so that the algorithm does not always deviate from a fixed direction during the first traversal. In extreme degradation, we cannot find the α 1 that can be further adjusted with the α 2 pairing. At this time, we give up the first sample.
Outer Loop (main program SMO) traverses non-boundary samples or all samples: preferentially traverses non-boundary samples, because non-boundary samples are more likely to need to be adjusted, while boundary samples often cannot be further adjusted and remain on the boundary (we can imagine that most samples are obviously not support vectors, and their Laplace multiplier does not need to be adjusted once they get zero values ). Traverse the non-boundary samples cyclically and select samples that violate the kkt condition for adjustment until all the non-boundary samples meet the kkt condition. When no non-boundary sample is adjusted during a traversal, all samples are traversed to check whether the entire set meets the kkt condition. If samples are further optimized in the entire set test, it is necessary to traverse non-boundary samples. In this way, the outer loop continuously switches between "traversing all samples" and "traversing non-boundary samples" until the entire training set meets the kkt condition.
The preceding kkt conditions can be used to test the sample accuracy. For example, the output UI of a positive non-boundary sample can be within a certain range of tolerances of 1, generally, the tolerance value is 0.001. If a very precise output algorithm is required, it cannot quickly converge.
4. Resetting after each minimum optimization
The error cache of each sample must be updated for each minimum optimization, so that the kkt test can be performed for other samples using the corrected classification, and the estimated step size when selecting the second paired optimization sample.
To update the error cache, you must first reset threshold B. We can directly use the information of the two optimized samples to make a simple correction based on the original threshold B0 without calling all the support vectors to recalculate B. If the minimum optimized α 1 * is not on the boundary, the formula for B is:
B1 = e1 + Y1 (α 1 *-α 10) K (X1, X1) + Y2 (α 2 *-α 20) K (X1, x2) + B0
If the minimum optimized α 2 * is not on the boundary, the formula for B is:
B2 = e2 + Y1 (α 1 *-α 10) K (X1, x2) + Y2 (α 2 *-α 20) K (X2, X2) + B0
When neither α 1 * nor α 2 * is on the boundary, B1 and B2 are equal. When both the two radians are on the boundary, both B1 and B2 and the numbers between them can be used as the threshold values that meet the kkt condition. In this case, the SMO algorithm selects the safest B1 and vertices in B2.
In the case of non-linearity, the calculation of the error requires all the supported vectors and Their Laplace multiplier:
In linear mode, the normal vector of the classification superplane is reset first.WAccording to UJ = W'XJ-B calculates the output UJ and the error EJ = UJ-YJ. Like threshold reset, the reset of normal vectors does not need to call all support vectors. You only need to modify the original normal vectors:
W* =W+ Y1 (α 1 *-α 1)X1+ Y2 (α 2 *-α 2)X2
Most of the resetting work is completed through simple non-cyclic computing, which makes it unnecessary for the SMO algorithm that requires many minimum Optimizations to spend too much time on the reset after each optimization. However, we can also see that the resetting of non-linear errors must calculate kernel functions one by one with all support vectors, and the calculation of kernel functions itself is more complex than the dot product, therefore, resetting non-linear errors will become the bottleneck of algorithm speed.
Iv. algorithm Flowchart
V. Logic of PlattCode
VI,Source codeImplementation
Unfinished, To be continued ......