Machine learning (three)-Support vector machines (1)

Source: Internet
Author: User
Tags svm rbf kernel

Summary:

This paper gives a brief introduction to support vector machine, and gives a detailed introduction to the linear scalable support vector classifier, linear support vector classifier and kernel function.

recently has been looking at the "machine Learning Combat" This book, because I really want to learn more about machine learning algorithms, coupled with want to learn python, in the recommendation of a friend chose this book to learn, today learning Support vector Machine (SVM MACHINES,SVM), This is a tool that is famous both in the field of pattern recognition and machine learning.  

Support Vector Machine (SVM) is a new tool to solve machine learning problems by using the optimization method, which has been put forward by V.vapnik and others, and has made great progress in theoretical research and algorithm realization in recent years, and has become a powerful means to overcome the difficulties of "dimension catastrophe" and learning.

1. Maximum Interval separation data

we know that the purpose of classification is to learn a classifier that maps data in a database to a given class To achieve predictions for unknown data categories.

For a two-dimensional dataset, the line separating the dataset is called separating the super-plane, if the separation is three-dimensional, the classification is the polygon, if the higher dimension is the super-plane. The decision boundaries of the classification are collectively referred to as hyper-planes, and all data distributed on the side of the plane belongs to a category, and all data distributed on the other side belongs to another category.

So for classification, it is very important to construct the classifier reasonably, how to construct the classifier makes the classification result more credible. Here is an example to illustrate:

for the two types of graphs in the left-hand coordinates, if you draw a line and logically separate them, how will you divide them? This method can be more, which is the best method of division? We can introduce a practical problem: Suppose that these two types of graphs are replaced by two residential areas, and now to build a road between the two residential areas, then how to repair? I think most people will choose that red line, the road weighed far and near, can also be understood as a compromise, relative to each point is the most fair.

This example can be seen in the sample to find the super-plane to separate the different categories, so in order to better classify, it is necessary to make the interval between the largest, (the interval refers to the distance between the points to the separation surface), so that if the division of errors or training in the limited data on the classifier, can also maximize the robustness of the classifier. The above example is in the two-dimensional plane to find the super-plane, directly with the naked eye can choose, however, if it is three-dimensional or higher dimensions of the situation, the human eye alone is powerless, but wit we can be found through the computer Ah, through the corresponding mathematical knowledge, the establishment of corresponding mathematical model, through the computer to solve.

This idea of searching for Super plane classification is the idea of SVM, and we learn support vector machine.

2. Support Vector Machine

The SVM for classification is essentially a two class classification model. SVM belongs to supervised learning, and the aim is to find a super plane in a sample set containing both positive and inverse examples to split the positive and inverse examples in the sample, and to ensure that the interval between the positive and inverse examples is maximal. This makes the classification result more credible, and can have better classification and prediction ability for unknown new samples. In order to reach the maximum interval between categories, we do not need to consider all the points, just let the distance separating the nearest point distance separating the plane as far as possible, where the nearest point separating the hyper plane is the support vector

Here's how SVM works:

First, given n training samples: {(x1, y1), (x2, y2) ... (xn, yn)}, where x is a D-dimensional vector indicating that each sample has a D attribute; yi refers to a category where Yi belongs to { -1,1}. Look for a real value function g (x) to infer the Y value corresponding to any one of the sample X with the classification function f (x) = SGN (g (x)).

(1) hard-spaced super-plane

Linear sub-SVM is to use the N samples above to train to learn to get a linear classifier, that is, to get a super-plane: f (x) = SGN (w x+b), linear can be divided to indicate when W x+b>0, the corresponding f (x) = 1, corresponding when W x+b<0, The corresponding f (x) =-1, while w x+b = 0 is the super-plane to look for, at this time the corresponding super-plane is a hard-spaced super -plane. So we're going to look for this super plane, based on the previous analysis, where we need to divide the sample into two categories, and make sure that the distance between the nearest points of the two classes is as far away as possible, and we'll combine the mathematical equation to analyze:

As shown, we are looking for a super plane with the largest separation of these two classes, to ensure that the distance between the two categories is as large as possible, the problem can be converted to maximize the distance between the two categories of the nearest point (support vector) distance between the separated polygons.

first, two are found in the plane parallel to the ultra-planar and equal distance of the super-plane: w x+b =-1 and W x+b = 1, to ensure that there are no sample points between the two super-planes, then the problem can be converted to maximize the distance between the two super-plane, and combined with the relevant mathematical knowledge, Because the hyper-plane is two-dimensional, the distance between them can be expressed as:d = |1+1|/sqrt (W12 + w22) = 2/| | w| |, the problem is maximizing 2/| | w| |, can be converted to minimize | | w| | At the end, there is no sample point between the two super-planar constraints, there is: for any positive sample yi=+1, it should be in w x+b = 1 The right side of the super-plane, that is to ensure that: y= W x+b>=+1, the same for any negative sample yi=-1, it should be in W x+b =-1 to the left, that is to ensure that: y = w x+b <=-1, so you can merge into: Yi (w xi+b) >=1.

So the problem of finding the best super plane can be transformed into two planning problems:

Min | | w| | 2/2

S.T. Yi (w xi+b) >=1 i =,..., N

The problem is characterized in that the objective function is a convex function, and the constraint is linear, then the Lagrange function can be introduced:

Then, according to the definition of wolf duality, the different variables of the original problem are biased 0:

Then, the Lagrange function can be transformed into the Lagrange duality problem of the original problem:

The optimal solution of the above problems is solved by calculating w* and b*:

By KKT complementary conditions can be:

Only when Xi is the support vector, the corresponding ai* is positive, otherwise 0, select a a positive component of a *, the calculation can be:

Therefore, the classification super Plane (w* x) +b* = 0 can be constructed, and the decision function is obtained:

  

The classification function is then obtained:

        

To classify the unknown categories. According to the conditions of KKT, only if XI is a support vector, the corresponding ai* is positive, otherwise it is 0. So, we just need to get the new sample and the inner product of the support vector, and then we can do the math.

(2) Soft-spaced super-plane

The above analysis is the sample point linear can be divided, we look for the hard interval over the plane, the first is to find two classification boundaries, and assume that all the sample points are outside the two classification boundaries, but the reality is not always so satisfactory, the following situation is bound to encounter:

This picture of the positive and negative classes are a bit to the "alternative" site, this time can not find a straight line to separate them, then how to compromise? For this data point has a certain deviation from the super-plane, we can still continue to use the super-plane division, but at this time to "soften" the interval, the construction of soft interval super-plane. In short, sample points are allowed between two classification boundaries, which are called boundary support vectors. This vector machine becomes a linear support vector classifier, as shown in:

The softening method for this problem is to introduce relaxation variables:

Thus, the constraint conditions for the original problem are as follows:

The relaxation variable setting allows some sample points to appear in the other's area, and when the slack variable is sufficiently large, the sample points always satisfy the above constraints, but also try to avoid too large a value. For this reason, we can readjust the objective function, introduce the penalty factor C and punish the outliers, then the two-time planning problem is converted into:

, of which, c>0

The corresponding Lagrangian functions are:

The dual problem of the corresponding original problem is:

We find that the c>=a of this constraint is much more than the linear model, which, in the same way, is calculated as follows:

    

The classification functions are:

Where c is infinitely large, it is equivalent to the linear sub-condition.

(3) Nuclear function

The above is a linear support vector classifier, which allows a certain degree of outliers, that if the sample point is really a line is not divided, it must be processed by the kernel function.

The model-Extensibility theorem for t.m.cover: A complex pattern analysis problem is mapped to a high-dimensional space, which can be divided linearly with the lower-dimensional space. The kernel method is to embed the original data through the feature map into the new feature space (Hilbert space) through the nonlinear mapping, discover the linear pattern of the data in the feature space, and then select the corresponding kernel function to calculate the inner product with the input. Based on the dual solution, the information needed for the algorithm is located in the inner product of the data points in the feature space, which affects the efficiency of the algorithm when the dimension is too large, so the inner product as the direct function of the input feature is more efficient to calculate the inner product and reduce the time complexity of the algorithm.

The commonly used kernel functions are:

Based on the above analysis, for the sample points of the line type is not divided, the problem is transformed into looking for the super plane in the Hilbert space:, the corresponding conversion to two planning problems:

where the kernel function K satisfies the condition:  

We use the RBF kernel function again to get the Lagrangian duality problem:

    

The corresponding calculation is obtained by the classification function:

Then it is possible to classify the linear non-divided problem accordingly. Because most of the * is 0, we only need to calculate the new sample and a small number of training samples of the kernel function, sum to the symbol to complete the new sample classification. The different kernel functions are used to classify the samples by different similarity degrees.

At this point, the relevant knowledge about support vector machine is almost finished, and some details will continue to learn.

Machine learning (three)-Support vector machines (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.