Support Vector Machine SVM

Source: Internet
Author: User
Tags svm

Introduction

SVM (Support vector MACHINE,SVM) is the maximal interval linear classifier defined in the feature space, and in the case of nonlinear data, the kernel method (kernel trick) is used to make it become a nonlinear classifier in essence. This paper is divided into two parts, 1) The maximum interval of the classification plane, this situation can be converted to a convex two-time planning problem, which will include the solution algorithm SMO, 2) Hinge Loss, through the empirical risk minimization, take Hinge Loss to obtain the loss function, Finally, we can find the extremum of the loss function. The next two sections are discussed in chapters.

1. Maximizing interval Interpretation

1.1 linear can be divided

Given the set of N data $\left \{(x_i,y_i) \right \}_{i=1}^n $, where $x _i \in \mathbb{r}^n$ for sample points, $y \in \mathbb{r}$ for class labels, per $ (x_i,y_i) $ Combined as a training sample, linear separable means that there is a hyper-planar $w \cdot x +b= 0$, which can separate the two types of data, so that the data points of the ultra-planar side of the corresponding label $y $ +1, the other side of the corresponding label $y $ all-1.

Note that when the training data is linearly separable, there are infinitely many super planes that can separate two kinds of data correctly, and the idea of SVM is to solve the optimal categorical super-plane by using the maximum geometric interval, which is the only solution.

1.2 Maximizing geometric intervals

In general, the distance of the training sample from the categorical hyper-plane can indicate the degree of certainty of categorical predictions, and the category of midpoint C =-1 is significantly higher than point A.

The degree of certainty can be expressed by the distance from the sample to the Superelevation plane, which is the usual geometric distance (the distance from point to line). The geometric distances of the sample points $x _i$ to the classification plane $w \cdot x +b= 0$ are as follows:

\[\gamma_i = \frac{|w \cdot x +b|} {|| w| |} \]

Where the numerator is the absolute value, the denominator is $L _2$ norm, and there is $| | w| | = \sqrt{w_1^2 + w_2^2+...+w_n^2}$, further observation, for the positive and negative training example, $w \cdot x_i +b> 0$, the label has $y _i = 1$, conversely, if $w \cdot x_i +b< 0$, There are $y _i =-1$, so you can remove the absolute value and get the following common geometric interval expressions:

\[\gamma_i = y_i \left (\frac{w \cdot x}{| | w| |} +\frac{b}{| | w| |} \right) \]

The geometric distance for the entire training set can be expressed as the smallest geometric distance in all n training samples:

\[\gamma = Min_i \gamma_i\]

Now with the geometric interval of the training data, according to the core idea of SVM: to find the maximum interval classification Super plane, we can get the following constraint optimization problem:

\begin{aligned}
&\max_{w,b}\gamma \ \
&S.T \ \ y_i\left (\frac{w}{| | w| |} \cdot x_i + \frac{b}{| | w| |} \right) \ge \gamma, \ \ i =,..., N
\end{aligned}

The constraint here means that the geometry interval of all samples in the training set is at least greater than the minimum geometric interval. Here is the introduction of a logical concept called the function interval.

\[\gamma_i = Y_i \left (w \cdot x_i+b \right) \]

The function interval for the entire data set is:

\[\widehat{\gamma} = \MIN_{I=1...N} \widehat{\gamma}_i\]

The relationship to the geometric interval is given by the following two formulas, respectively:

\[\gamma_i = \frac{\widehat{\gamma}_i}{| | w| |}, \gamma = \frac{\widehat{\gamma}}{| | w| |} \]

Based on the above function interval, we can further simplify the SVM optimization formula:

\begin{aligned}
&\MAX_{W,B} \frac{\widehat{\gamma}}{| | w| |} \\
&S.T \ \ y_i\left (W\cdot x_i + b \right) \ge \widehat{\gamma}, \ \ i =,..., N
\end{aligned}

function interval is introduced because it can be freely scaled, such as for the classification of super-plane $w \cdot x +b= 0$, if the left and right expand $\lambda$ times, $\lambda w \cdot x +\lambda b= 0$, the plane will not change, for the point $x _i$ According to the function interval formula:

\[\gamma_i = Y_i \left (w \cdot x_i+b \right) \]

Then you get:

\[\lambda \gamma_i =\lambda y_i \left (w \cdot x_i+b \right) \]

That is, the function interval also expands the \lambda times, but at this time the super plane and the geometric interval will not have any change, so it does not have a slightest effect on our optimization function, because the optimization goal is to minimize the geometry, that is to say, the scale of the super-plane parameter can not affect the final result, So you can make the function interval of the DataSet = 1. In the end, only the maximum:

\begin{aligned}
&\MAX_{W,B} \frac{1}{| | w| |} \\
&S.T \ \ y_i\left (W\cdot x_i + b \right) \ge 1, \ \ i =,..., N
\end{aligned}

Changing the maximum to a minimum is a convex two-time programming form with inequality constraints:

\begin{aligned}
&\MIN_{W,B} \frac{1}{2}| | w| | ^2 \ \
&S.T \ \ y_i\left (W\cdot x_i + b \right) \ge 1, \ \ i =,..., N
\end{aligned}

In this paper, we distill the learning algorithm of SVM in linear sub-conditions, and the algorithm 1.1:

Linear sub-datasets $\left \{(x_i,y_i) \right \}_{i=1}^n$,

(1) Structural constraint optimization problem:

\begin{aligned}
&\MIN_{W,B} \frac{1}{2}| | w| | ^2 \ \
&S.T \ \ y_i\left (W\cdot x_i + b \right) \ge 1, \ \ i =,..., N
\end{aligned}

(2) The solution obtains the $w^*,b^*$, the classification super plane is $w * x+b^* = 0$

(3) For the new observation data $x $, according to $f (x) = sign (w^* \cdot x+b^*) $ to determine its category $y $ can.

One of the key factors in SVM is the support vector, what is the point of the support vector?

According to the constraints in algorithm 1.1 $w _i (w \cdot x_i +b) –1 \ge 0$, the points in the data set satisfy the above constraints, the point $x _i$ support vector when the equation is established, that is, satisfies $w _i (w \cdot x_i +b) = 1 $< c1>. Support Vectors also meet several of the following properties:

    • The distance to the category plane is $1/|w| | $, because before we do not affect the results of the case, the artificial setting function interval \widehat{\gamma} is 1;
    • Support vectors are located on two lines $w \cdot x_i + b = \pm 1$, which line depending on the category;
    • When deciding on hyper-plane, only support vectors work, that is, SVM is determined by very few "important" support vector samples, so SVM is very sensitive to outliers.

1.3 Conversion to dual problem

This section mainly explains how to solve the inequality constraints in algorithm 1.1 two times planning problem, is a convex optimization problem, here are two references, as long as these two understand, SVM is a cloud, many of the conclusions of this section from the two articles.

    • Lagrange Multiplier method and kkt condition of constrained optimization method
    • Lagrange duality

Regarding to the even problem and the KKT condition is really very important here, not familiar with first to understand the above two articles can be, and the advantages of introducing duality in SVM are:

    • The duality problem has good properties;
    1. The dual problem must be convex optimization problem.
    2. The duality problem is the lower bound of the original problem, and when it satisfies certain conditions, the two are equivalent.
    • The kernel function can be introduced naturally.

Now back to the previous optimization goal, that is, the original problem:

\begin{aligned}
&\MIN_{W,B} \frac{1}{2}| | w| | ^2 \tag{primal problem} \ \
&S.T \ \-Y_i\left (W\cdot x_i + b \right) -1\le 0, \ \ i =,..., N
\end{aligned}

Here the inequality constraint optimization is written in the form of a common $g (x) \le 0$, and the Lagrange function is constructed first:

\[l (w,b,a) = \frac{1}{2} | | w| | ^2-\sum_{i=1}^na_i (y_i (w \cdot x_i+b)-1) \]

This Lagrangian function is very interesting, first of all, must satisfy the $a _i \ge 0$, followed by the relaxation of complementary conditions, which is the origin of the support vector, that is always:

\[\sum_i a_i (y_i (w \cdot x_i + B)-1) = 0\]

For $a _i > 0$, the point of the support vector, and then converted to dual problem solving, first of all, according to the Lagrange function to write a primitive problem, then converted to dual problem solving, the purpose of this, reference to this article, the original problem is constructed as follows:

\[\max_{w,b}l (w,b,a) = \frac{1}{2}| | w| | ^2\]

Because the second term of the Lagrange function is 0 when the constraint is satisfied, the original problem of SVM and the optimal solution $p ^*$ for the original problem are:

\[p^* = \min_{a_i \ge 0} \max_{w,b}l (w,b,a) \]

The form of duality and the optimal solution $d ^*$ are:

\[d^* = \max_{w,b} \min_{a_i \ge 0}l (w,b,a) \]

In duality theory, the duality problem is the lower bound of the original problem, that is $p ^* \ge d^*$, the original problem satisfies the Salter condition, that is to say: Existence $w ^*,b^*,a^*$ respectively is the solution of primitive problem and duality problem, make:

\[p^* = d^* = L (w^*,b^*,a^*) \]

Because satisfying the Salter condition (i.e. satisfying the strong duality), it is known that each solution satisfying the original problem and the dual problem satisfies the KKT condition, namely:

\begin{aligned}
&\NABLA_WL (w^*,b^*,a^*) =w^*-\sum_{i=1}^na_i^*y_ix_i = 0 \ \
&\NABLA_BL (w^*,b^*,a^*) =-\sum_{i=1}^n a_i^*y_i = 0 \ \
&a_i^* (y_i (w^* \cdot x +b^*)-1) \ge 0 \ \
&y_i (w^* \cdot x +b^*)-1 \ge 0 \ \
&a_i \ge 0, \ i =,..., N
\end{aligned}

According to these conditions, the following can be obtained:

\begin{aligned}
w^* &= \sum_{i=1} ^n a_i^* y_ix_i \ \
b^* &= Y_j-\sum_{i=1}^n a_i^* y_i (x_i \cdot x_j) \ \
\end{aligned}

Finally, to clarify the idea, first constructs the duality problem, according to the original question satisfies the Slater condition to know the strong duality establishment and the primitive problem and the duality question to have the common optimal solution, according to the strong duality establishment to understand each pair of primitive problem and the duality problem the optimal solution is satisfies the KKT condition, therefore must first from the duality question, The duality problem is the max min type, which looks at the smallest part first:

1)min Part : The assumption is the fixed value, solves the minimum value, directly to the $w, the b$ to seek the biased guide:

\begin{aligned}
\frac{\partial L (w,b,a)}{\partial W} &= 0 \rightarrow w-\sum_{i=1}^n a_iy_ix_i = 0 \ \
\frac{\partial L (w,b,a)}{\partial B} &= 0 \rightarrow-\sum_{i=1}^n a_iy_i = 0
\end{aligned}

Note that the above formula is equivalent to producing a constraint $\sum_i a_i y_i = 0$, bringing the above results into L (w,b,a), there will be:

2)max part : the min section gets the solution, now asks about the $a $ of the maximum, that is, the current optimization function becomes:

\begin{aligned}
&\max_{a_i \ge 0} \-\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^na_i a_jy_iy_j (x_i \cdot x_j) + \sum _{i=1}^Na_i\\
& \ s.t \ \ \ \sum_{i=1}^n a_iy_i = 0
\end{aligned}

Now it's done, as long as the solution duality problem obtains the $a ^*$, then obtains the $w ^* according to the KKT condition, b^*$, can perfectly solve the original problem the optimal solution

Through the above two parts, we get the final linear scalable support vector machine learning algorithm 1.2

Input: Linear data set $\left \{(x_i,y_i) \right \}_{i=1}^n$,

(1) to construct constraint optimization problems:

\BEGIN{ALIGNED}
&\max_{a_i \ge 0} \-\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^na_i a_jy_iy_j (x_i \cdot x_j) + \sum _{i=1}^Na_i\\
& \   S.T \ \ \ \sum_{i=1}^n a_iy_i = 0
\end{aligned}

(2) solution to get $a ^*= (a^*_1,a^*_1,..., a^*_n) ^t$ for general adoption SMO algorithm

(3) based on the $a ^*$ solution to get $w^*,b^*$, first select $a^*_j$ support vector $ (x_i,y_i) $;

\begin{aligned}
w^* &= \sum_{i=1} ^n a_i^* y_ix_i \
b^* &= y_j-\sum_{i=1}^n a_i^* y_i (x_i \cdot x_j) \ \
\end{aligned}

(4) obtained a hyper-planar $w ^* \cdot x +b^* = 0$, for new observational data $x $,  According to $f (x) = sign (w^* \cdot x +b^*) $ to determine its Category $y $.

the definition of support vectors is explicitly given here: according to the KKT condition $a _i^* (y_i (w^* \cdot x_i +b^*)-1) = 0$ , training data set $a _i^* >0$ sample point $ (x_i,y_i) $ , where $y_i (w^* \cdot x_i +b^*) -1= 0$, input $x _i,y_i$ is the support vector, the support vector must be on the interval boundary, and there are $a _i^* (y_i (w^* \cdot x_i + b^*)-1) = 0$. For non-support vectors, that is $y _i (w^* \cdot x_i+b^*)-1 >0$, there must be $a _i = 0$. At this point, the two classification SVM for the linear sub-conditions is all pushed to completion, but there is a problem: when there are some outliers (outlier) in the data, the data is linearly divided after removing these outlier, in which case the handling of outliers is introduced.

treatment of 1.4 outlier

Given a dataset, where the sample is a sample point, there is no linear separable, such as left, and there is a case where the data set is linearly separable, but due to the presence of outliers, the detached hyper-plane is squeezed, such as right.

Non-linear separable super-plane subjected to extrusion

This means that the function interval for some sample points does not meet the requirement of greater than 1. To solve this problem, introduce a relaxation variable for each sample, so that the function interval can be less than 1, that is satisfied, it is obviously not arbitrarily large, because that will lead to arbitrary sample points to meet the requirements, so to improve the optimization objectives:

,

,

The method of solving this optimization target is the same as the linear data set, and the Lagrange function is constructed first:

The multiplication in Lagrange function is more than 0, it is obvious that

Therefore, the optimization goal becomes a very small form, which is the application of Lagrange duality and kkt conditions, will get the ultimate optimization of the minimax problem, but also to the problem of the solution is divided into a very small number, first to the min part:

Bring the above results in, and the Min section becomes:

Next is the max section:

, (here according to the elimination of the)

At this point, we can summarize the algorithm of linear 1.3 SVM with outliers :

Linear data sets, data sets are non-linear, accompanied by outliers

(1) Structural constraint optimization problem:

(2) The solution is obtained, and the SMO algorithm is generally used for solving

(3) According to the solution, the first choice of the support vector

(4) To obtain a super-plane, for the new observation data, according to determine its category = +1 or = 1

The solid line is the separation of the super-plane, the dashed lines are the interval boundary, and the distance between the interval boundary and the separation plane is 1, the geometric distance is 1/| | w| |. In the case of linear irreducible, the component corresponding to the borrowed is called the support vector, when the support vector is more complex than the linear, such as the point, the following support vector points to the interval boundary of the geometric distance has been marked as. This geometric distance is calculated from the point to the separation of the super-plane function distance, and the separation of the super-plane and the interval boundary of the function distance is 1, so the point to the interval boundary of the distance, the geometric distance simply divided to get the result.

It can be seen that the support vectors at this time are either distributed on the interval boundary or between the interval boundary and the hyper plane , or on one side of the separation of the super plane .

First consider the following two formulas:

Note the constraints, if, if so, when the resulting vectors are support vectors, the following is the classification of support vectors:

If, it is a support vector and falls exactly on the interval boundary

According to, if, then, the point to the separation of the function distance of the super-plane is

If, the classification is correct, falling between the interval boundary and the separation of the super-plane

According to, if, at this point, the distance from the point to the separation of the super-plane is

If, on the separation of the super-plane

According to, if, at this point, the distance from the point to the separation of the super-plane is

If, it is on the side of the separation of the super-plane error

According to, if, at this point, the distance from the point to the separation of the super-plane is

To sum up, and consider KKT conditions, you can get the relationship of support vectors:

At this point, the problem with the non-linear can be solved, but it is also important to note that the data set is completely non-linear can be divided. This is what needs to be introduced. The kernel method is not a patent of SVM, it can solve a series of machine learning problems. The next section is mainly about the application of kernel methods in SVM.

1.5

Given the dataset, which is a sample point, when the dataset is completely non-linear, as shown, only one super-surface can be found at this time to classify the data, this is the data is non-linear can be divided, nonlinear problem is difficult to solve, The conversion to linear problem is relatively simple (the front 1.1-1.5 has already given a solution), which needs to be introduced at this time. That is, the non-linear sample points in the original space are mapped, so that they can be linearly divided into the new space. Give a map, the original space for, define a new space for, according to the map can see the sample point changes, at this time the sample is linear separable, directly using can be separated from the original sample point.

is a mapping, but in general, the selection of feature space is often very high-dimensional, following a mapping:

The following is a formal definition of the kernel function , set as the input space, for the feature space, if there is a mapping to all, function satisfies, is called the input space to the feature space mapping function, is the kernel function.

The common technique of kernel functions is not to calculate mapping functions, because the feature space is usually high-dimensional, even infinite dimensions, so the calculation is not easy, but the calculation of the kernel function is relatively simple. The mapping is varied and can take different feature spaces, and different mappings can be taken even in the same feature space. The mapped sample is typically linearly divided with outliers, considering its optimization objectives:

Since the inner product of the sample is computed in the input space, it is only necessary to modify the inner product of the target function, and to define the kernel function which is easy to compute, and not to define the mapping function, which is the kernel technique. Therefore, the optimized target of a linearly non-divided data set becomes:

Now, even if the linear irreducible data is mapped to a new feature space, it will be obtained by the mapping function, and the SVM can be learned linearly in the new feature space. When the mapping function is a nonlinear function, the resulting SVM is a non-linear model. That is to say, given the kernel function, we can solve the nonlinear problem by solving the linear classification problem, the advantage of nuclear technique is that we don't need to define the feature space and mapping function explicitly, just choose a proper kernel function. The integrated kernel function is used to eliminate the explicit calculation of high-dimensional transformations, and to calculate the inner product of a high-dimensional vector by using the parameters of the lower dimension directly into the kernel function.

The choice of kernel function, what function can be used as an effective kernel function? The answer is to satisfy the Mercer theorem , that is, if the function is a map on (that is, mapping from two n-dimensional vectors to real fields). Then if it is an effective kernel function (also known as the Mercer kernel function), then only if its training sample corresponding to the kernel function matrix is symmetric semi-definite.

Here we explain the positive definite matrix first:

Set is the order square, if to any non-zero vector, there is, wherein the representation of the transpose, is called positive definite matrix.

The positive definite matrix has the following properties:

1) Positive definite matrices must be non-singular. Definition of singular matrix: if n-order matrix A is singular, its determinant is zero, that is.

2) Any principal matrix of a positive definite matrix is also a positive definite matrix.

3) If the order positive definite matrix, then is the order invertible matrix.

Given N training samples, each sample corresponds to a training sample. So, we can put any two and take into the kernel function, calculate. This can be expressed as a matrix (matrix), that is, as long as the matrix is a symmetric semi-positive definite, it is an effective kernel function.

It is clear that symmetry is well-satisfied:

The following is a positive characterization, for any nonzero vector,

Obviously the last item is greater than 0, the above is the method of determining the legality of the kernel function, and finally gives some common kernel functions.

1) Linear Core

The linear kernel actually does not adopt the nonlinear classifier, and considers the sample to be linearly divided.

1) Polynomial core

The kernel function corresponds to a sub-polynomial classifier, at which point the additional parameters need to be adjusted to

2) Gaussian core

The kernel function corresponds to a Gaussian radial base classifier, the kernel function can even map the feature space to infinity, when the additional parameters need to be adjusted, where the number of words if the choice is very large, the weight of the high-order feature actually decays very fast, so in fact (approximate) the equivalent of a low-dimensional subspace Conversely, if you choose very small, you can map arbitrary data to linear, of course, this is not necessarily a good thing, because the attendant may be a very serious overfitting problem. However, in general, by regulating parameters, Gaussian cores are actually quite flexible and one of the most widely used kernel functions.

In this paper, we give the learning algorithm of Nonlinear 1.4 support vector Machine:

Given, the dataset is completely non-linear, that is, only after mapping to a high-dimensional space can it be a linearly-divided dataset that can be accompanied by outliers.

(1) Structural constraint optimization problem:

(2) The solution is obtained, and the SMO algorithm is generally used for solving

(3) The first selection of support vectors is obtained according to the solution.

(4) To obtain the super plane, for the new observation data

(5) Judging by its category = +1 or =-1.

2. Hinge Loss explanation

In the classical Logistic regression, the-log loss function of loss function is used

Here another, the graph such as the red curve (the Green line in the figure is square loss, the Blue line is hinge loss), the closer to 1, the smaller the value, the smaller the loss.

Conversely, the resulting image should be about the symmetric red Line (not drawn), at this point the closer to 1, the smaller the value, that is, the smaller the loss.

2.1 Two classification issues

Given the data set, to use these data to make a linear classifier, that is, the optimal separation of the super-plane to divide the sample into positive and negative, the given data set only need to obtain the optimal parameters, the first to make the following linear mapping relationship, according to the empirical risk minimization principle, each training sample two classified Hinge Loss for, the corresponding , so SVM can obtain the optimal separation super plane by minimizing the loss function two directly:

2.2 Multi-classification issues

Given the data set, now to use this data to do a class of linear classifier, now need to optimize the parameters into, at this time, for a matrix, for a vector, now the mapping relationship is as follows:, at this time,, each component represents the classifier in the category of the score, the label of the sample, if it belongs to the category, then In addition to the first component of the remaining elements are all 0, such as the 5 classification problem, belongs to the 3rd class, then, with the expression of the first component of the score vector, representing the corresponding component, in the multi-classification of the Hinge loss can be expressed as:, so K linear classification SVM Hinge loss function:

Support Vector Machine SVM

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.