Stanford CS229 Machine Learning course Note five: SVM support vector machines

Source: Internet
Author: User
Tags dashed line svm

SVM is considered by many people to be the best algorithm for supervised learning, and I was trying to learn this time last year. However, the face of long formulas and the awkward Chinese translation eventually gave up. After a year, see Andrew to explain SVM, finally have a more complete understanding of it, the general idea is this: 1. Introduce the concept of the interval and redefine the symbol; 2. Introduce functional margins and geometric margins respectively ; 3. The maximum interval classifier is derived, and the optimization problem is transformed into a convex function. 4. The knowledge of Lagrange duality is added. 5. Using Lagrange duality, the duality problem of maximal interval classifier is deduced, that is, the optimization formula of SVM and the inner product of the formula. 6. The importance of the kernel function derived from the inner product --the use of kernel functions greatly shortens the computational time when the characteristic vector dimension is very high; 7. Using regularization to solve the problem of linear irreducible and anomalous points; 8. The SMO algorithm is introduced to effectively solve SVM.

Maximum interval classifier 1. Assumptions, models, symbols

Let's assume that all of the data is linearly divided (and then removed from the hypothesis) by visually sensing the classifier.

Although a, B, and C three points are divided into the same class, we think that the classification of point A is the most sure, because it is farthest from the classification line (that is, the largest interval). Therefore, the most fundamental part of this classifier is how to calculate the interval. However, before we introduce the interval, let's look at the model and its symbols:

You can see that it resembles the model of logistic regression, but G (z) is changed from a logical function to a piecewise function. In addition, the output variable was changed from {1,0} to {1,-1},θtx to Wtx+b, and the changes were made in order to finally deduce that the company could be more elegant.

2.functional margins

By the model of the largest interval classifier we know: wtx+b>0 ==> y=1;wtx+b<0 ==> y=-1, and Wtx+b=0 is a separate hyper-plane (separating hyperplane).
When Wtx+b>0, the greater the confidence level of the WTX+B classification to 1, the greater the confidence in the classification of-1 when the wtx+b<0 is smaller wtx+b. Therefore, the functional margins of each sample is defined as:

Thus, when each sample is correctly categorized, the value of functional margins will always be greater than 0, and the larger the value, the higher the confidence of the classification. It is important to note that we can simultaneously zoom in/out of any of the same multiples for the parameters W and B, although the results of the classification are not affected, but the values of the functional margins change.
The functional margins of the entire training set is defined as:

3.geometric margins


, geometric margins is the data point to the separation of the vertical distance of the plane, the calculation process is ignored here, one of the more critical step is: the separation of the plane wtx+b=0 the normal vector is W, need to prove their own. Finally, the geometric margins formula for each sample is:

By this formula first we can get the relationship between geometric margins and functional margins:

The value of geometric margins does not change, even if both the W and B zoom in or out at any of the same multiples, compared to functional margins. This is important and will be used in the subsequent derivation of the formula.
The geometric margins for the entire training set are:

4. Optimization of the maximum interval classifier

Our goal is to find the geometric margins to maximize the separation of the hyper-plane, the transformation into the optimization problem is:

Due to | | w| | is a non-convex function that will lead to difficult to solve, so we need to make some conversions, first of all, by the previous conclusion "arbitrary simultaneous scaling W and B will not change the size of the geometric margins", we can let functional margins 1

The original optimization goal becomes the maximization 1/| | w| |, also the equivalent of minimizing | | w| | 2, so the conversion is to convex optimization problem:

Can be solved with two-time programming software.

Lagrange duality

This part is mainly the content of the optimization theory, the core idea is that the original optimization problem can be transformed into its dual problem by Lagrange duality (the purpose is that the duality problem can improve the efficiency of the solution), I just barely understand the derivation process, so it will simply list the formula:

1. The original optimization problem


This problem corresponds to the generalized Lagrangian formula:

To deform the original problem appropriately:

2. Dual optimization problems


Compared to the original optimization problem, the dual optimization problem only reverses the order of Max and Min, and we have:

3.KKT conditions

In order for d*=p* to solve the original optimization problem by solving the duality problem, the following assumptions need to be met:

Under this hypothesis, there must be w*, α*, β* so that w* is the solution of the original problem and α*, β* is the solution of the dual problem. In addition, there are some useful relationships that will be established, which are called: Karush-kuhn-tucker (KKT) Conditions:

where Αi*gi (W) =0 is known as KKT dual complementarity condition. It implies that αi>0 is possible only when GI (w) is =0, which is why SVM has very few support vectors, which is also used in the convergence verification of the SMO algorithm.

Svm

Let's go back to the maximum interval classifier we discussed earlier:

Using Lagrange duality to solve the dual problem of maximum interval classifier optimization problem, SVM, let's deduce:
First, the constraints are transformed into standard forms that can use generalized Lagrangian:

Recall Kkt Dual complementarity condition, only when GI (w) =0 is possible αi>0, and GI (w) =0 corresponds to the nearest point of distance separating the super-plane (the process of deducing the optimal formula of the maximum interval classifier is reviewed), That is, the three points through which the dashed line crosses:

And these three points are the sv:support vectors in SVM.
The current optimization problem is constructed in the form of Lagrange:
(1)
Next, the duality problem is solved by making a partial derivative of 0, which is obtained in the condition of fixing α, not fixing W and B, to minimize the L (w, b,α), i.e. θd:

With (2) and (3) into (1), you can get:

Finally, the optimization problem for SVM is:

After obtaining α, we can obtain the parameters W and B in the model by α:

The linear combination of the model can also be obtained directly using α:

So the question is, why do we have to "toss" the optimization problem of the original clear maximum interval classification into SVM with Lagrange duality? One of the big reasons is the kernel function that comes up next.

Kernels

Sometimes we want to map low-dimensional vector features to high dimensions to increase the accuracy of the classification, such as adding two or three entries to the model:

Since the algorithm can be completely expressed as the eigenvector of the inner product form <x, Z>, so long as the mapped inner product <φ (x), Φ (z) > replace the original inner product can be. Given a feature mapping rule φ, we define kernel as:

When the dimension of the input space is very high (even an infinite dimension), we can have the kernel function to greatly reduce the computational amount, such as:

From this we can see that the calculation of <φ (x), Φ (z) > Need O (N2) length, and the direct calculation of K (x,z) only need O (n) time, when the dimension is very high, this advantage will be very obvious! Next look at a few common kernels:

The feature map function φ can even be mapped to an infinite dimension, such as Gaussian kernel:

Of course, not just writing a function can be a kernel function, the condition of the kernel function is:
Any sample {x (1),..., X (M)}, (m<∞) corresponds to a nuclear matrix, is a symmetric semi-positive definite matrix. The definition of a nuclear matrix is: Kij = K (x (i), X (j))
In addition, the kernel function is not dedicated to SVM, as long as the objective function can be completely expressed in the form of internal product, you can use the kernel function to efficiently calculate the high-dimensional vector space classification results (Andrew in the class mentioned that the logistic regression can also be written in this form). And the power of nuclear functions is huge, for example, Andrew mentioned two examples: handwritten numeral recognition, and protein classification, the application of Gaussian kernel in SVM algorithm or (xtz+c) d can be obtained and artificial neural network comparable results.

The kernel function also has a meaning: the low-dimensional linear irreducible data through the kernel function map to the high-dimensional space, so that the data can be divided by a linear super-plane, that is, the output of a nonlinear decision-making boundary. But sometimes there are some outliers that we will improve by adding penalties. L1 Norm Soft Margin SVM

So far, we've all thought of data sets as linear, but not necessarily in reality; In addition, when there are outliers in the dataset, the performance of the classifier will be severely affected, as shown in:

To solve the above two problems, we adjust the optimization problem to:

Note: When ξ>1, it is possible to allow the classification to be wrong, and then we add the ξ as a penalty to the target function.
Using Lagrange duality again, we get the duality problem as:

Surprisingly, after adding the L1 regularization item, only a αi≤c is added to the like limit in the dual problem. Note that the b* calculation needs to be changed (see Platt's paper)
KKT dual-complementarity conditions will become:

Now, it's just the problem of solving duality.

SMO algorithm

First understand the coordinate ascent algorithm coordinate ascent:

The core idea is to maximize the function in one dimension at a time, so it may have more loops, but the internal calculation of the loop is relatively simple.
Now consider our duality problem:

For this problem, we cannot directly use the coordinates to rise:

So, we'll change the coordinates up to two coordinates at a time:

The method of judging convergence is whether the KKT conditions satisfies the tolerance parameter. (See Fast Training of support vectors machines using sequential Minimal optimization)
The maximum value is calculated at each step:

We can get: W is a two-time function about α2. It is very simple to derive the maximum value from the two-time function derivation, and after considering the limits of the upper boundary, the new value of α2 is:

The optimal value of α1 can be obtained by α2 calculation.

Stanford CS229 Machine Learning course Note five: SVM support vector machines

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.