Support Vector Machine (bottom)

Source: Internet
Author: User
Tags svm

In the last section, the model of the optimal interval classifier is introduced, and the meaning of the support vector is briefly described, and then this section will be expanded around the support vector machine model and its optimization method SMO .

The original optimal problem of the optimal interval classifier model:

In order to solve the model, the dual optimal problem is obtained:

Suppose the function h (w,b) =g (wtx+b) is:

Therefore, the important concept of kernel function is derived, which is necessary for the optimization method of SVM.

At the same time, in the process of solving the model, we will encounter the disturbance of outliers, and we need to revise the model and propose the concept of soft interval .

This section outlines:

    • Kernel function
    • Soft interval
    • SMO algorithm

First, the kernel function

1. Feature Mapping

The original attribute of the problem is x,x1,x2, which needs to be processed before the input learning algorithm to get the input characteristics, assuming that φ is a feature map of the original attribute to the input feature (feature mapping):

2. Nuclear function

(1) Basic definition

In relation to the previous question, when we change the intrinsic product of the original attribute <x,z> to the inner product of the input feature <φ (x), Φ (z) >, we get the definition of the kernel function (kernel) :

As long as you get φ (x), φ (z), and then calculate their inner product it is easy to get the kernel function K (x,z), even if Φ (x) is a high-dimensional vector, it takes a lot of computational cost, but the effective calculation of kernel function can let support vector machine learn the characteristics of high dimensional space. Let's take a look at the following example for a specific experience.

The final result is the definition of the kernel function.

The feature map Φ (x) is:

The calculation of φ (x) requires O (N2), while the calculation of K (x,z) requires only O (n).

(2) General form

Consider the form of another kernel function:

The feature map φ (x) is:

The parameter C controls the relative weights between Xi and Xiyi.

The more general kernel function equation:

The feature map φ (x) is a dimension, the calculation of φ (x) requires O (ND), and the calculation of K (x,z) still requires only O (n), so there is no need to explicitly represent the eigenvector in a very high dimensional feature space.

(3) Gauss nucleus

So how do we construct the kernel function?

Consider a more complex kernel function, because φ (x) and φ (z) are two vectors, and if they are very close we expect K (x,z) to be very large, and we expect K (x,z) to be very small when they are far apart. In other words, it is to measure the similarity between φ (x) and φ (z) or the similarity between x and Z by K (x,z). The kernel functions can take the following form:

When x is very close to Z, the value of K (x,z) is 1, and when X is far apart from Z, the value of K (X,z) is 0. This kernel function is called the Gaussian nucleus (Gaussian kernel), which is the kernel function used by the support vector machine.

(4) Legality of nuclear

Then, again, how do we judge that there is a feature map φ, the kernel function for all x and Z is established, that is, the legitimacy of the nuclear function ?

Suppose K is a kernel function, and k is also a mxm square, kij= (K (i), K (j)), called the Nuclear matrix (Kernel matrices). If the kernel is valid, K must be a symmetric matrix, since the inner product of the two input x,z is independent of the order of the two. The k value of φ (x) is denoted by φ (x) and is:

Therefore, K is a semi-positive definite matrix. In other words, if k is a valid nucleus, then its corresponding nuclear matrix is a symmetric semi-positive matrix. In fact, this is a sufficient condition that conforms to the Mercer theorem .

The linear classifier classifies the data that is not linearly separable in the original space, and the whole process of SVM outputting the nonlinear decision boundary is a process to solve the convex optimization problem. Many other algorithms can be written in the form of inner product, we can change the inner product into kernel, and map the feature space to infinite dimension space, in order to solve the problem that the low dimension space cannot be achieved.

Second, L1 Norm soft boundary SVM

The SVM algorithm has a precondition that the data set is linearly separable, but sometimes it is not guaranteed to be linearly separable in all cases or to obtain a satisfactory separation of the super-plane. As shown, when there is a far outliers (outliers), there is a huge swing in the hyper-plane, which makes the interval smaller than the margin.

In order to satisfy the nonlinear classification and reduce the influence of outliers, the optimal interval classifier algorithm is needed to correct (adjust the regularization L1 value):

By correcting the function interval value to 1-ξi, the penalty item cξi of the objective function is increased, where C controls the objective function | | w| | 2 The minimum and maximum function interval in a constraint condition is the relative weight between the minimum of 1.

The Lagrangian function is:

Its dual form is:

It is noted that the constraint condition changed here is only the value range of like, and satisfies the kkt complementary condition.

L1 Norm Soft boundary SVM can handle the nonlinear separation case, when the case of the exception data is included, choose not to do the exact separation to solve the problem. The problem that has not been solved now is only the solution of this duality problem, and some solutions are introduced below.

Third, Sequence minimization optimization algorithm (SMO)

1. Coordinate ascent method (coordinate ascent)

Consider solving unconstrained optimization problems:

Lenovo has previously learned the optimization solution, such as the gradient Rise method and Newton method, now to introduce a new method-coordinate rise method.

Fixing all parameters other than like, solving the optimal value of the W (like) function, and then fixing the other parameters sequentially, the cyclic optimization W (α), so that it grows the most, the general order of α selection is α1,α2,..., αm,α1,α2,...。 The following is an optimization process of a w (α) Coordinate ascent method with two parameters, followed by the maximum value in the direction parallel to the coordinate.

When other parameters are fixed and the optimal value of a parameter is solved, the convergence speed of the coordinate ascending method is better than that of Newton method.

2. SMO (sequential minimal optimization)

In order to solve the problem of L1 Norm soft boundary SVM model, it is actually to find out the like value that satisfies two constraints, at the same time make the maximum of the target function W (α), if the ascending coordinate method is used to resolve this issue, it is not feasible to fix the value of α2,..., αm to find the maximum α1 value of W (α)

Because of the existence of kkt constraints, when other values are determined, the values of the α1 are fixed, and there is a fixed correlation between them, and the value of W (α) cannot be optimized at this time. The improved method of course exists, only need to change two like simultaneously can solve. The basic steps are:

    • Use heuristics (heuristic) to select two variables like and αj for updating values;
    • Fixed other alpha values, using like and αj two values to optimize W (α);
    • Repeat the above steps until the algorithm converges to find the global maximum of W (α);
    • By verifying whether the Convergence tolerance parameter (Convergence tolerance parameter) tol(0.01~0.001) satisfies the kkt condition, the convergence is judged.

The SMO algorithm is an effective algorithm, with the emphasis on updating the values of like and αj in a very fast way.

The following is a push to update the like and αj. Known by constraints:

Can be converted to the following formula and:

There must be a lower bound l and the upper limit h so that the α2 satisfies [0,c], and the presence of ξ (straight line) causes the α1 to meet [0,c], so that it falls on a straight line or L. The number of parameters can be reduced by the relationship between α1 and α2:

After updating the parameters, W (α) can be expressed as a two-time function of α2:

=

The maximum value of W (α) can be obtained by optimizing two functions by selecting the appropriate α2 value.

If you want to learn more about the heuristics (heuristic) selection variables like and αj values and how the SMO algorithm runs when parameter B is updated, you can refer to John Platt's paper.

Reference documents:

sequential Minimal optimization:a Fast algorithm for Training support Vector machines

Support Vector Machine (bottom)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.