Support Vector Machine SVM

Source: Internet
Author: User
Tags svm

1.1 SVM Concept

Support Vector Machine SVM is an original (non-combinatorial) classification algorithm with obvious geometrical meaning, which has high accuracy. Originating from Vapnik and Chervonenkis's early work on Statistical Learning (1971), the first paper was published by Boser, Guyon, and Vapnik in 1992. The idea is intuitive, but the details are very complex, which involves convex analysis algorithms, nuclear functions, neural networks and other advanced fields. In layman's terms, it is a two-class classification model, whose basic model is defined as the most spaced linear classifier on the feature space, that is, the learning strategy of support vector machine is to maximize the interval, and finally can be transformed into a convex two-time programming problem solution.

The idea is simple, linear can be divided, the problem into a convex optimization problem, can be simplified by Lagrange multiplier method, and then solved with the existing algorithm. Complex case, linear can not be divided, using mapping function to project the sample into a high-dimensional space, so that it becomes a linear sub-situation. Use kernel functions to reduce the computational complexity of high dimensions.

Problem: Optimal separation plane (decision boundary)

Maximum edge over plane (MMH)

Here we consider a two-class classification problem, the data point is represented by X , which is an n -dimensional vector, the T in W^t is transpose, and the category is represented by y , which can take 1 or-1, Represents two different classes, respectively. The learning goal of a linear classifier is to find a categorical hyper-plane in the n -dimensional data space, whose equation can be expressed as:

1.2 1 or-1 origin of the classification criteria: Logistic regression

The purpose of logistic regression is to learn a 0/1 classification model from features, which is a linear combination of attributes as an independent variable, since the value range of the independent variable is negative infinity to positive infinity. Therefore, using a logistic function (or sigmoid function) to map an argument to (0,1), the mapped value is considered to be the probability of belonging to Y=1. a formal representation is a hypothetical function where x is an n-dimensional eigenvector and the function g is a logistic function. , and the image looks like this, and the infinity is mapped to (0,1). and assuming that the logistic function is the probability that the characteristic belongs to Y=1, therewhen we want to distinguish a new feature belongs to which class, only needs, if greater than 0.5 is Y=1 class, and vice versa belongs to the Y=0 class. only and related, if >0, then, at this time the feature belongs to the Y=1 class. G (z) is only used for mapping, and the true category decision is still in the right. and then, = 1, and vice versa = 0. if we only start out, we hope that the goal of the model is nothing more than to let the Y=1 feature in the training data, and the y=0 feature. Logistic regression is to learn, so that the characteristics of the positive case is far greater than 0, the characteristics of negative examples is far less than 0, emphasizing in all training instances to achieve this goal. 1.3 Formal Labelingthe label used this time is Y=-1,y=1, replacing the y=0 and Y=1 used in logistic regression. They will also be replaced by W and B. Before, which thought. Now we replace with B, followed by (i.e.). In this way, we let, further. In other words, except that Y is changed from y=0 to Y=-1, it is not different from the formal representation of logistic regression. and then explicitly assume the functionAs mentioned above, we only need to consider the positive and negative issues, not the G (z), so we will make a simplification of G (z), it is simply mapped to Y=-1 and Y=1. The mapping relationship is as follows:in this way, it must have been explained why the criteria for linear classification are generally marked with 1 or-a. Note: The above section comes from the notes from Jerrylead's Stanford machine learning course. linear classification of 2.1 SVM

The linear classification of SVM is to find the largest gap, that is, the decision-making boundary distance farthest

The x1,x2 are two points of the upper support boundary and the lower support boundary respectively, then the two-type subtraction is satisfied according to the inner product definition, then it can be obtained, that is, the problem becomes the convex optimization problem when it is synthesized. Convex optimization can be solved by Lagrange multiplier method.

2.2 Lagrange Multiplier method

Background: Geometric interpretation of Lagrange multiplier method

Where the projection of the F (x, Y) target function, g (x, Y) =c is the constraint. When the objective function gradient and the constraint function gradient are opposite to each other, the objective function is optimal under this constraint. But the Lagrange multiplier method is applicable to the constraint condition equation. So solving this convex optimization problem requires KTT condition.

2.3 KTT Conditions

KTT conditions apply to the following optimization problems

where f (x) is a function that needs to be minimized, H (x) is an equality constraint, and g (x) is an inequality constraint, and P and Q are the number of equality constraints and inequality constraints, respectively. At the same time, we have to understand the following two theorems

    • The meaning of Kkt condition: it is a necessary and sufficient condition for a nonlinear programming (nonlinear programming) problem to have the best solution method.

What exactly is the so-called Karush-kuhn-tucker condition? The KKT condition means that the minimum point in the standard form of the above optimal mathematical model x* must meet the following conditions:

It is proved that the problem satisfies the KKT condition (first has satisfied the Slater condition, furthermore F and GI are also micro, that is, L to W and B can be guided), so now we are converted to solve the second problem. That is, the original problem has been transformed into a duality problem by satisfying certain conditions. In order to solve this dual learning problem, it is divided into 3 steps, first of all L (w,b,a) is to minimize the w and b , then to find the maximum of α, and finally use the SMO algorithm to solve the duality factor.

2.4 Simplification for dual problems

Re-substituting the above gradient results into the Lagrangian function


At this time, Lagrange function simplification is a duality problem,

Obviously than the previous convex optimization problem is concise, can be solved with a variety of convex optimization algorithms, only support vectors participate in the calculation, so the computational scale is much lower than our imagination. The Lagrangian function contains only one variable, which is very large, that is, the optimization problem of duality problem can be obtained. It is possible to find W and B. Then the classification function can be easily obtained.

The unknowns in the dual formula only involve Lagrange multipliers, and the unknowns in the original problem also contain the geometric characteristic parameters of the decision boundary, too many unknowns. The vigilance of the undetermined multiplier is more than 0, only at the "support vector" is not 0, so the final function expression is simpler than imagined (but the problem is not known in advance which sample points are "support vectors"),

2.5 SMO algorithm

The value of the Lagrangian multiplier may remain doubtful. In fact, the solution can use a fast learning algorithm, the SMO algorithm, which is briefly introduced here. The SMO algorithm was proposed by Microsoft's John C. Platt in 1998 and is the fastest two-time planning optimization algorithm, especially for linear SVM and data sparse performance. The basic idea is to update only two multiplier at a time and get the final solution by iterating. The calculation flow can be expressed as follows

Specific iterative procedures and methods are detailed in the reference Jerrylead http://www.cnblogs.com/jerrylead/archive/2011/03/18/1988419.html

3.1 Linear non-divided case

Let's talk about the linear irreducible situation, because linear can be divided into such a hypothesis is too limited : is a typical linear irreducible classification, we can not use a straight line to divide it into two areas, each area contains only one color point. In this case the classifier, there are two ways, one is to use the curve to completely separate, the curve is a non-linear situation, followed by the nuclear function will have a certain relationship.

3.1.1 Increasing relaxation variables and penalty functions

The other way is to use straight lines, but do not have to ensure that the division, is to accommodate those sub-fault situation, but we have to add a penalty function, so that the more reasonable point points, the better the situation. In fact, in many cases, not in the training of the classification function is more perfect, because some of the training function of the data is noise, may be in the manual with the classification of the label when added wrong, if we are training (learning) to the wrong point of learning, Then the next time the model encountered these error conditions will inevitably error (if the teacher gave you a lecture, some knowledge points wrong, you still believe, then in the test will inevitably error). This learning time to learn the "noise" process is an over-fitting (over-fitting), which is a big taboo in machine learning, we would rather learn less than some content, but also resolutely put an end to learning some of the wrong knowledge. or back to the topic, how to split the linear indivisible points with a straight line:

We can add a little penalty to the points that are wrong, and the penalty function for a divided point is the distance from the point to its correct position. In the blue, red lines are the boundaries of the support vectors, the green lines are the decision functions, and those purple lines represent the distance from the divided points to their corresponding decision planes , so that we can add a penalty function to the original function, and take its restriction as

The blue part of the formula is the part of the penalty function added on the basis of the linear sub-problem, when Xi on the wrong side, if the value of C is very large then want to get the minimum value, just want to make the smaller the better then the closer to 1, that is, the distance between the points to the decision-making surface of the smaller. When Xi is on the right side, ε=0,r for the total number of points, C is a user to specify the coefficient, indicating the number of points in the wrong to add the penalty, when the C is very large, the points will be less divided, but the situation may be more serious over fitting, when the C is very small, the points of the wrong point may be many, However, the resulting model may not be correct, so how to choose c is a lot of learning, but in most cases it is through experience to try to get.

Next is the same, to solve a Lagrangian duality problem, to get an expression of the duality problem of the original problem:

The blue part is different from the linear-divided dual problem expression. The duality problem obtained in the linear irreducible case, the difference is that the range of α is from [0, +∞], to [0, C], the increased penalty ε does not add any complexity to the duality problem.

3.1.2 Kernel function

Just in the case of non-division, we have mentioned that if you use some nonlinear method, you can get the curve that perfectly divides the two categories, such as the kernel function that will be said next.

We can make space from the original linear space into a higher dimensional space, in this high-dimensional linear space, and then a super-plane division . Here's an example of how to use the dimensions of space to get higher to help us categorize (examples and images from Pluskid's kernel function section):

is a typical linear non-divided case

But when we map these two elliptic-like points to a high-dimensional space, the mapping function is: Use this function to map a point in a plane to a three-dimensional space (z1,z2,z3), and then rotate the mapped coordinates to get a linear set of points.

In another philosophical example: There are no two identical objects in the world, and for all two objects, we can make a difference by adding dimensions, such as two books, from two dimensions (color, content), which may be the same, and we can add the author to this dimension We can also join the page number , can join the owner , can add the place of purchase , can join the note content and so on. when the dimension is added to an infinite dimension, it is certain that any two objects can be divided .

Recall the dual problem expression that you just got:

The obnoxious of the red character in the formula is to use the mapped sample vectors instead of the inner product, the original feature is n-dimensional, we map it to n^2 dimension, and then calculate, so that the time required from the original O (n) into O (n^2), which creates a disaster dimension

A nuclear function is required after the catastrophic dimension is created. Where the map to the high-dimensional function is the left, the kernel function K (x,z) as the right, then found that only the original sample of the X and Z of the square (time complexity is O (n)), is equivalent to the calculation of the mapping of high-dimensional samples of the inner product. The function of calculating the inner product of two vectors in the space after implicit mapping is called the kernel function . (Kernel Function).

Example 1:

Example 2:

The kernel function can simplify the inner product operation of the mapping space, but the quantity calculation in the SVM is the form of the inner product. According to the classification function in the low-dimensional space, it can be

Then the dual optimization problem of a is expressed as

The computational problem of the dual optimization problem is solved, avoiding the computation in the high dimensional space, but the result is the same. Of course, because our example here is very simple, so I can construct the corresponding kernel function, if for any one map, want to construct the corresponding kernel function is very difficult. Therefore, it is necessary to verify the validity of kernel functions constructed by high-dimensional mappings.

4.1 Effective determination of kernel function

Question: Given a function k, can we use K instead of calculation, and say, can we find one so that for all x and z?

For example, it is possible to think that K is an effective kernel function.

The following solves this problem, given a M training sample, each corresponding to a eigenvector. So, we can calculate any two and bring it into the K. I can be from 1 to M,j can be from 1 to M, so you can calculate the kernel function matrix of m*m (Kernel matrix). For convenience, we represent both the kernel matrix and the use of K.

If you assume that K is a valid core function, then you define it according to the kernel function

As can be seen, the matrix K should be a symmetric array. Let's draw a stronger conclusion by first using symbols to represent the K-dimensional attribute values of the mapping function. So for any vector z,

The final step is similar to the previous calculation. From this formula we can see that if K is a valid kernel function (i.e. and equivalence), then the kernel function matrix K obtained on the training set should be semi-definite ()

So we get the necessary conditions for a kernel function:

K is an effective kernel function ==> kernel function matrix K is symmetric semi-definite.

Fortunately, this condition is also sufficient, expressed by the Mercer theorem.

Mercer theorem:

If the function k is a map on (that is, from two n-dimensional vectors mapped to real fields). So if K is an effective kernel function (also known as a Mercer kernel function), then if and only for the training sample, its corresponding kernel function matrix is symmetric semi-definite.

Mercer theorem shows that in order to prove that K is an effective kernel function, then we do not need to look for, but only in the training set to find each, and then determine whether the matrix K is semi-positive (using the upper left corner of the principal type is greater than or equal to zero) can be.

The above can only be used as support vector learning notes

Reference Blog Park Jerrylead,c blog mac track and smelting into gold machine learning support vector machine content

Support Vector Machine SVM

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.