Machine learning-Support vector machine SVM

Last Update:2018-10-19 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction:

Support Vector Machine (SVM) is a supervised learning model of two classification, and his basic model is a linear model that defines the largest interval in the feature space. The difference between him and the Perceptron is that the perceptron simply finds the hyper-plane that can divide the data correctly, and SVM needs to find the most spaced hyper-plane to divide the data. So the perceptual machine's super plane can have countless, but SVM's super plane only one. In addition, SVM can deal with nonlinear problems after introducing kernel functions.

SVM can be divided into the following three forms depending on the data:

1. Linear can be divided into support vector machine, also known as hard interval support vector machine, the processing of data is linearly divided, through the hard interval maximization to learn a linear model.

2. Linear support vector machine, also called soft interval support vector machine, when the data is approximate linear time-sharing, by introducing relaxation factor, the soft interval maximization learns a linear separable model.

3. Nonlinear support vector machine, when the data is linearly non-tick, by introducing the kernel function to map the data to the high dimensional space, learning to obtain a nonlinear support vector machine.

Linear Scalable Support Vector machine

Consider a two classification problem, when the data can be in the distribution space through a super-plane to split the positive and negative samples, one side is a positive class, one side is a negative class, we call the data is linear separable, this separation of the super-plane equation is: w*x+b=0. And in the linear divisible data there are countless super-plane can divide the data open (reference perceptron), we want to find the best of the super-plane, this super-plane can not only be training set data very good division, but also better generalization ability. It is clear that line B is the best dividing line, so choose a super plane with the largest interval as the optimal super plane we need.

Figure 1

The solution of W*x+b=0 is to find W and B, and the corresponding classification decision function f (x) =sign (W*X+B), which is called linear sub-support vector machine.

Based on the point-to-line distance formula:

Figure 2

Where A is a w vector and C is b because the Y value of the super-plane is 0, so the point-to-plane distance of the support vector machine can be written:

Figure 3

And because the boundary of the w*x+b is plus or minus 1, two boundaries to the plane of the distance and Gamma equals twice times R, so Figure 3 can also be written:

Figure 4

This is the geometric interval. are the various concept diagrams for support vector machines:

Figure 5

When the number of points is a positive class, its y=+1,w*x+b>=+1, when the number of points is a negative class y=-1,w*x+b<=-1, so y* (w*x+b) is always greater than or equal to 1, where y* (w*x+b) becomes the function interval.

To find the largest interval of the super-plane, that is, to find the y* (W*X+B) to meet the >=1 constraints of the parameters W and B, so that the largest gamma. That

Figure 6

Figure 7

The objective function and the constraint function are continuous and micro convex functions, and the objective function is two functions, the constraint function is affine function, so the constraint problem is a convex two-time programming problem.

The common method to solve the problem of convex two-time programming constraints is to introduce Lagrange multipliers and get the solution of the original problem by solving the duality problem. This is the dual algorithm of the linear scalable support vector machine.

First, the Lagrange function is defined, and the Lagrange multiplier ΑI>=0,I=1,2,3....N is introduced for each inequality constraint. The Lagrangian function is:

Figure 8

In this way, the extremum problem with constrained problem is transformed into unconstrained extremum problem, then the duality problem of primal problem is solved according to Lagrange duality:

Figure 9

The above is the whole process of solving the linear can be divided into support vector machine, the solution can get W and b after the classification of the super-plane:

Figure 10

and categorical decision functions:

Figure 11

As can be seen from the results of w* and b* in Figure 9, w* and b* rely only on αi>0-corresponding (Xi,yi) sample points, which are called support vectors.

Linear Support Vector Machine

When the data is approximately linearly separable, that is, there is a noise point in the data, and by introducing relaxation factors, the function interval plus the relaxation factor ξ is greater than or equal to 1, so that the constraint becomes:

Figure 12

For each relaxation factor the ξi needs to pay a price, so the objective function, which is the cost function, becomes:

Figure 13

Wherein C>0 is called the penalty parameter, is the super parameter needs our manual adjustment parameter, the greater the C value when the penalty of the mistake classification increases, the support vector machine's interval width is narrower, the C value is less than the penalty of the error classification, the more the interval width of the support vector machine is wider. The geometrical meaning of ξ represents that the distance from the correct classification of the data points is the geometrical distance.

So how does this relaxation factor ξ? Because the data is approximate and there are many noise points, so when calculating the cost function, these mis-classification points are counted into the cost function. The function intervals of these y* (w*x+b) <=-1, so the cost function can be written as a set function with 0/1 loss function, that is, when the function interval of the data points minus 1 is less than 0 (the wrong classification point), the cost function needs to be calculated, the function interval minus 1 is greater than 0. Correct classification points), do not need to calculate into the cost function:

Figure 14

But the mathematical property of 0/1 loss function is not good, non-convex discontinuity. So instead of using his replacement loss function, "Hinge loss max (0,1-z)", the cost function, which is the target function, becomes:

Figure 15

Replacing the max part with ξ is ξ<=1-y* (w*x+b), so the target (cost) function with the constraint becomes the following form:

Figure 16

Therefore, the loss function of SVM can also be regarded as a L2 regular term (| | w| | ^2) the hinge loss function. The above is the constrained optimization objective function of linear support vector machine, and the process of solving W and B is consistent with the linear method, all by introducing Lagrange multipliers, which are not repeated here. The constraints that some of these columns need to meet are called kkt conditions.

Nonlinear support vector machines

When the data sample is non-linear, that is, in the current data space (or in the current dimension) can not find a super-plane to split the data, then we need to map the data from the current dimension to a higher dimension, so that the data into a linear separable, and the data map to the high-dimensional function called the kernel function.

Why we use the Lagrange multiplier method to solve the optimization problem with constrained condition in SVM, one is because the solution is simple and the other is to introduce kernel function K (x,z) conveniently.

By introducing the kernel function, the object function of the duality problem becomes:

Figure 17

Finally, the decision classification functions after w* and b* are solved:

Figure 18

By introducing the kernel function, the nonlinear support vector machine can be solved by the method of linear support vector machine. Learning is implicit, not understanding how the kernel functions are computed and what dimension the data is mapped to, but requires that we manually select the kernel function. The commonly used kernel functions are:

Polynomial cores:

Figure 19

Gaussian nucleus (radial basis core):

Figure 20

Linear nuclei, sigmoid nuclei, and other nuclear functions. Kernel functions are usually selected using prior knowledge or cross-validation, but if there is no prior knowledge, the Gaussian kernel is generally chosen. Why choose a Gaussian nucleus? Because you can map data to an infinite-dimensional space.

Minimum optimization of the SMO sequence

This learning method is to simply solve the parameters of the SVM algorithm, is not very important (change-^-^), so there is no very detailed look, later have time to read and then update to this article.

Pending Update:

Reference books:

The method of statistical learning Hangyuan Li

"Machine learning" Zhou Zhihua

Small Elephant College Machine learning Course (SHAMBO)

Machine learning-Support vector machine SVM

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More