Leftnoteasy's understanding of SVM blog (I)

Source: Internet
Author: User
Tags svm

Copyright:

This article by leftnoteasy released in http://leftnoteasy.cnblogs.com, this article can be all reproduced or part of the use, but please note the source, if there is a problem, please contact the wheeleast@gmail.com

Preface:

I haven't updated my blog for a long time. It has been two months since the last update. One of the major reasons is that I don't know what to write-_-. I recently read an article about SVM (Support Vector Machine), and I think SVM is very interesting, as a result, I plan to write an article about SVM today.

There are a lot of papers and books on SVM. "SVM is an algorithm that truly applies mathematicians", cited by qiangge ". SVM is very difficult for most ordinary people to fully understand the mathematics in it. Therefore, to let these ordinary people understand it, they must explain the mathematical knowledge in a simple language. It is also helpful to learn other things if you want to understand these mathematics. I belong to the vast majority of ordinary people. To understand SVM and read a lot of information, I will share my experiences here.

In fact, we can find a lot of Chinese information about SVM, but I personally think that everyone's understanding is not the same, so I decided to write it, some similarities are inevitable, but I still want to write something different from others. In addition, I will not talk about too much mathematics in this article (because many articles have talked about it) and try to give a simple conclusion, just like a question-in Machine LearningAlgorithm(Previously called mathematics in machine learning ),Therefore, the content of this series will be more application-oriented. For more detailed mathematical explanations, see references.

1. linear classifier:

First, a very, very simple classification problem (linear differentiation) is given)We need to use a straight line to separate the black and white points. Obviously, this line on the figure is one of the lines we require (there can be no number of such lines)

Let us say that we make the Black Point =-1, the white point = + 1, and the straight line f (x) = W. X + B. Here, X and W are vectors. In fact, this form is equivalent to f (x) = w1x1 + w2x2... + Wnxn + B. When the dimension of vector X is 2, F (x) indicates a straight line in two-dimensional space. When the dimension of X is 3, f (x) it indicates a plane in a three-dimensional space. When the dimension of X is n> 3, it indicates the n-1 hyperplane in an n-dimensional space. These are relatively basic content. If you are not clear about them, you may need to review the contents of calculus and linear algebra.

As we have just said, we set the black and white vertices to + 1 and-1 respectively. So when there is a new vertex x that needs to predict which category it belongs, we can use SGN (f (x) to predict. SGN represents a symbolic function. When f (x) is greater than 0, SGN (f (x )) = + 1. When F (x) <0, SGN (f (x) =-1.

However, how can we obtain an optimal division line f (x? Number of Possible f (x)

A very intuitive feeling is that this line is the farthest from the nearest point in the given sample. This sentence is a bit difficult to read. Below are some figures to illustrate:

Method 1:

Method 2:

Which of the two methods is better? Intuitively, the larger the gap, the better, and the better the points of the two categories. Just as we usually judge whether a person is a man or a woman, it is very difficult to make a mistake. This is caused by the gap between the male and female categories, this allows us to classify data more accurately.In SVM, maximum marginal is one of the theoretical foundations of SVM.There are many reasons to choose the function that maximizes the gap as the split plane. For example, from the probability perspective, it is to make the point with the minimum confidence level the maximum confidence level (which sounds very difficult ), from the perspective of practice, the effect is very good. I will not discuss it here. As a conclusion, it will be OK ,:)

The points drawn out by red and blue coils are the so-called support vector ).

It is a description of the gap in the previously mentioned category. Classifier boundary is f (x), and the red and blue lines (plus plane and minus plane) are the faces of the Support Vector, the gap between the red and blue lines is the gap between the categories to be maximized.

The m formula is provided here: (it is easy to obtain the resolution Ry from the high school, or refer to Moore's PPT later)

In addition, the support vector is located in a straight line between wx + B = 1 and wx + B =-1. we multiply the class y to which this point belongs (remember? If y is not + 1 or-1), the expression of the support vector is Y (wx + B) = 1, so that the support vector can be expressed more simply.

When the support vector is determined, the split function is determined. The two problems are equivalent. To get the support vector, another function is to make those points behind the support vector do not need to be involved in the calculation. This will be explained in more detail later.

At the end of this section, we provide the expressions for optimization:

| W | it refers to the second norm of W. the denominator of the above M expression means that m = 2/| w |, maximization is equivalent to minimization | w |, And because | w | is a monotonic function, we can add square to it, and the preceding coefficient, it should be easy for familiar students to see it. This formula is for convenience of guidance.

There are some restrictions for this formula. The complete writing should be like this :(Original problem)

S. t means subject to, that is, the meaning under the following restriction conditions. This word is very easy to see in SVM papers. This is actually a constrained Quadratic Programming (qP) problem. It is a convex problem. A convex problem means that there is no local optimal solution. You can imagine a funnel, no matter where we put a ball in the funnel at the beginning, the ball will eventually fall out of the funnel to obtain the global optimal solution. The restriction conditions behind s.t. can be seen as a convex polygon. What we need to do is to find the optimal solution in this convex polygon. These problems are not discussed here, because
You can't finish writing a book. If you have any questions, see Wikipedia.

Ii. convert it to a dual problem and optimize the solution:

This optimization problem can be solved by using the Laplace multiplier method, and the theory of the kkt condition is used. Here, we will directly develop the objective functions of the formula:

The process of solving this formula requires the related knowledge of the pair (In addition, the pluskid also has an article dedicated to this problem), and there is a certain formula for derivation. If you are not interested,You can jump directly to the backUseBlue FormulaThis section mainly references the article from plukids.

First, Let L minimize W and B, respectively set the partial derivative of L about W and B to 0.Original problemAn expression

Take the two formula back to L (W, B, A) to obtain the expression of the dual problem.

When a new problem is added, the condition is (Dual Problem):

This is the formula we need to optimize. So far,We have obtained the optimized formula for the linear severable problem..

There are many ways to solve this formula, such as SMO. I personally think that solving such a Constrained Convex Optimization Problem is quite independent from obtaining this convex optimization problem, therefore, the preparation in this article does not involve how to solve this topic at all. If you have time later, you can make up the previous article To Talk About It :).

3. Cases where linear division is not possible (soft interval ):

Next, let's talk about linear division, becauseThis assumption of linear differentiation is too limited.Now:

It is a typical linear classification chart. We cannot use a straight line to divide it into two areas. Each area contains only one color point.

There are two methods for Classifier in this case,One is to use CurvesTo completely separate them, a curve isNon-linearAs mentioned laterCore functionsThere is a certain relationship:

 The other method is to use a straight line, but it does not need to be guaranteed.That is, to tolerate those error points, but we have to add the penalty function so that the more reasonable the error points, the better. In fact, in many cases, the more perfect the classification function is during training, the better, because some data in the training function is inherently noise. It may be wrong when the classification label is manually added, if we have learned these error points during training (learning), the model will inevitably make mistakes the next time we encounter these errors (if the teacher gives you a lecture, if a knowledge point is wrong and you believe it is true, mistakes will inevitably occur during the exam ). The process of learning "noise" is
Over-fitting is a taboo in machine learning. We prefer to learn less and never learn more wrong information. Back to the topic, how to use a straight line to separate the points that are not linear:

We can add a penalty for the points that are divided into errors.Penalty FunctionYesThe distance from this point to its correct position:

In the middle, the blue and red lines are the boundary of the support vector, the Green Line is the decision function, and the purple linesIndicates the distance from the faulty point to the corresponding decision surface.In this way, we can add a penalty function on the original function, with the following constraints:

In the formula, the blue part is the penalty function that is added on the basis of the linear differentiation problem. When Xi is on the correct side, ε = 0 and R are the number of all vertices, C is a coefficient specified by the user, which indicates the penalty for the number of points to be divided into errors. When C is very large, the number of points to be divided into errors will be less, however, the case of over-fitting may be serious. When C is very small, there may be many points of error, but the resulting model may be incorrect, therefore, there is a lot of knowledge about how to choose C, but in most cases, it is obtained through experience.

The following is the same: to solve a dual problem of the Langran system, obtain the expression of the dual problem of the original problem:

The blue part is different from the dual problem expression that can be linearly divided. The dual problem obtained when linear division is not possible. The difference is that α ranges from [0, + ∞) to [0, C]. the increased penalty ε does not increase complexity for dual problems.

Iv. core functions:

Just now, when talking about the inseparable situation, I mentioned that if some non-linear methods are used, we can get the curves that divide the two categories perfectly, such as the kernel function to be discussed next.

We canChange the space from the original linear space to a higher dimensional space.,In this high-dimensional linear space, a hyperplane is used for Division.. Here is an example to understand how to use the spatial dimension to help us classify (examples and the image's kernel function section from pluskid ):

It is a typical linear inseparable situation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.