Understanding Support Vector Machines

Source: Internet
Author: User

Support Vector machine is a class two classification model, but it can also be extended to multi-class classification. Based on the characteristics of space maximization and kernel technique, it can flexibly deal with linear or nonlinear classification problems.
The support vector machine is a convex two-time programming problem, and the learning algorithm is the optimization algorithm based on convex two-times programming.
According to whether the training data is linear or not, support vector machine can be divided into a linear SVM based on hard interval, a linear support vector machine based on soft interval, a nonlinear support vector machine based on kernel technique and the maximization of soft interval. The complexity of the three is increased in turn.
1. Linear scalable support vector machine based on hard interval maximization
We know that learning methods, such as perceptron and decision trees, do not differentiate between the input and feature spaces of the model, that is, the space in which they are located is the same. The input space and feature space of support vector machine are different, the input space is Euclidean space or discrete set, and the feature space is Hilbert space. Hilbert space can be regarded as an extension of Euclidean space, and its spatial dimension can be arbitrary dimension, including infinite dimension. And an important property is that it has the completeness of Euclidean space. These features are required for support vector machines to do nonlinear feature space mapping.
The following from the simplest linear can be divided into support vector machine, learned that the awareness of the machine know that the perception of the machine by training a super-plane will be planar or spatially divided into a linear point.
Its super-plane equation is w?x+b=0;
Classification decision function f (x) =sign (w?x+b).
The same is true of linear divisible support vector machines, which divide data sets by looking for a split plane. The difference is that the learning strategy of perceptual machine is the error classification point to the super-plane distance and minimization, and the linear-divided support vector machine is based on the maximum of hard interval.

What is hard interval maximization?
We know that when the parameter w,b is determined, the classification of the super-plane is also determined, then the classification of the super-plane on both sides of the point to the super-plane distance can be drawn,

These points to the classification of the super-plane must have a minimum distance, in fact, can be divided between the two sets of points of the super-plane parameter w,b have many groups, the same pair should be the minimum distance. What is the best way to classify w,b when choosing a value? is the distance between the two sides of the classification surface, the better the classification effect, that is, to find out the maximum value of this set of minimum distance. To measure this value, the concept of function interval and geometric interval is derived.
In the case of the super plane, it can be understood that the distance between the point X and the ultra-plane is relative to the super plane.
and whether the symbol is consistent with the class marker symbol can indicate whether the classification is correct.
Function interval: For a given training dataset T and a Hyper plane (w,b), define the Super plane (w,b) function interval for the sample points:

Defined:
The minimum distance for all sample point functions for the hyper-planar (w,b).
However, there is a problem with the function interval, that is, if we change the value of w,b proportionally, then the value of the function interval is twice times that of the original, and the hyper-plane does not change. To solve this problem, a geometric interval is defined:
For a given training dataset T and a hyper-plane (w,b), the geometry interval for the sample points defined by the superelevation plane is:
Defined:
The minimum value for the geometric interval of all sample points for the Hyper-plane (w,b).
where | | w| | For the L2 norm of W, the distance is constrained by dividing the norm of the normal vector to ensure that the geometric interval of the point to the super-plane is constant if the w,b is proportional to the change.
From the definition of function interval and geometric interval, we can see:
Here we return to the hard interval maximization. In order to make the classification work best, we require that this minimum interval value be the largest, that is, the following objective function and the constraint conditions of the w,b value,
That
We know
Function interval? With the change of w,b proportional changes, even if w,b at the same time into the original twice times, the function interval is the original twice times, so when the w,b proportional change, the objective function is unchanged, the constraints are unchanged. In other words, the value does not affect the target function and constraints, and does not affect the whole problem
Solve, so for the next calculation convenience, we take = 1, so the objective function can be written as:
Because the maximum of the request is equal to the demand | | w| | The minimum, which is equivalent to the minimum, to rewrite the problem to:
The minimum of the maximum equivalence is to convert the objective function into a convex two-time programming problem, so as to satisfy the kkt conditions for the subsequent duality problem. Coefficient plus 1/2 is for the derivation of the time coefficient, the calculation is convenient.
The problem now is how to solve the optimal function under the condition of inequality constraints. We can use Lagrange multiplier method to solve, the form of the definition of Lagrange function is as follows:

It is known from the above that, because H (x) =0, g (x) ≤0, L (x,α,β) under the conditions of the constraint must be less than or equal to f (x), and Max L (x,α,β) =f (x).
A Lagrange multiplier ≥0 is introduced for each constraint condition, and the constraint is added to less than equals by the form of a Lagrangian multiplier, and the Lagrangian function is defined:
So the problem becomes a request.
Generally speaking, such a form is not easy to solve, we can turn to the solution of its duality problem.
The original problem has the same optimal solution as the duality problem, which requires the original problem to satisfy the KKT condition. The so-called Kkt condition is:
1. L (w,b,α) The derivative of x is zero;
2. h (x) = 0;
3.α?g (x) = 0;
In this case, the convex two-time planning problem which we constructed earlier is useful, which can prove the duality problem, and the optimal value of the dual problem is the same as the optimal value of the original problem.
After converting to a dual problem, the solution is as follows:
1. Ask
The Lagrangian function L (w,b,α) is biased to w,b and equal to 0 respectively, and the solution of

2. The great for α, namely:

To solve these problems, the SMO algorithm can be used to find the solution of α to α=
We know α≥0, and can use contradiction to prove that there must be at least one α_j>0, if α is all equal to 0, then the above-known w=0, and w=0 is obviously not the solution of the original problem. For this i,α_j>0, and according to the KKT condition α?g (x) = 0, so g (x) = 0, namely:

Note = 1, will replace 1, and extract, can be calculated:

This allows you to write out categorical hyper-planes:

The classification decision function is:

2. Linear support vector machine based on maximum soft interval
The above-mentioned linear separable support vector machine is based on the ideal state of linear separable training samples, when there are noise or specific points in the training sample resulting in linear non-tick, you need to use the following linear support vector machine.
In linear scalable support vector machines, we assume that the function interval is 1, if there is a noise or a specific point function interval is
(0,1) in the middle, then these points do not meet the constraints of the problem, but also linearly non-divided. In order to solve this problem, the relaxation variable ≥0 is introduced so that the sum of the function interval and the relaxation variable is greater than or equal to 1, and the constraint condition becomes:
At the same time, because of constraints introduced, so the objective function should also change, instead:

C>0, called the penalty coefficient, is generally determined by the application problem, and when the value of C is large, the penalty for mis-classification increases. The minimized objective function contains two meanings: one is to make as small as possible the interval as large as possible, one is the wrong classification point as little as possible, c as the two harmonic coefficients.
This condition is known as the maximum soft interval. The problem can be defined as:

The Lagrangian function is still constructed and converted to the duality problem:

Support Vectors
I've been talking about support vector machines before, so what is a support vector? Support Vector machines classify data sets based on the maximization of the interval. In the case of maximum hard interval, such as the red line is the separation of the super-plane, the pink and blue lines are the two types of points separated by the maximum interval, the distance to the separation of the super-plane, the decision of the maximum interval between the two points on the Blue line and pink is the classification of the key point, called the support vector.

In the case of maximum soft interval,
For example, the distance from each instance point to the hyper-plane is.
Support vectors consist of points that are key to classification, such as the interval boundary (such as Red punctuate), the interval boundary and the hyper-plane (such as the green punctuate), or the wrong point (such as the Blue punctuate).

3. Nonlinear support vector machine based on soft interval and kernel technique
The data set mentioned above is linear or approximate, and the data in real case is non-linear and can be divided.
In this case, nonlinear transformations are needed to map the input space inputs to the high-dimensional feature space, and the nonlinear problems are transformed into linear problems, and then the linear classifier is used to classify them. For example:
For nonlinear equations, we take the original equation and change it to:

Thus, the problem can be transformed into a linear variational problem in three-dimensional space.
The kernel function is to do such a thing, it transforms the input space into a Hilbert space through a nonlinear transformation, so that the hyper-surface model in the input space corresponds to the hyper-plane model in the feature space. The kernel functions are defined as follows:
Set χ is the input space (Euclidean space or discrete set), η is the feature space (Hilbert space), if there is a mapping from χ to η
φ (x): Χ→η makes for all x,z∈χ, functions κ (x,z) =φ (x) φ (z), for example:
Assuming that the input space is a two-dimensional Euclidean space, the kernel function is κ (x,z) =
The feature space can be taken into three-dimensional Euclidean space, recorded input x= (x1,x2), z= (Z1,Z2)
Take the mapping function as:

is satisfied κ (x,z) =
The linearly-divided vector machines mentioned before can be known, the classifier function of the vector machine depends on the inner product of the X and the input sample points, when the linearity is irreducible, we map the inner product (X,Z) of the sample point to the inner product of the feature space (φ (x), φ (z)) using the kernel function, and still use the linear classifier defined above. Thus, an effective and simple nonlinear classification problem is realized. The classification decision function of nonlinear support vector machine is as follows:

where x is the input, x_i,y_i is the sample point, α_i can be solved by SMO, B is about α_i,x_i,y_i function, can also be obtained, so the function classification results can be derived.
Here, the general idea of support Vector machine is basically completed, that is to find the process of classifying super-plane structure classification function, when encountering nonlinearity, we should find a way to convert it into linear problem and classify it by linear classifier. The process of solving the problem is to obtain the parameters of the classification super-plane, when the parameters are not easy to solve, converted to their dual problem to seek, finally, finally can be found out.
Well, the support vector machine is too long, let's go, the next section introduces kernel functions and SMO algorithms.

Understanding Support Vector Machines

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.