The problem formalization of SVM

Last Update:2015-03-21 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The problem formalization of SVM

The duality problem of SVM

kernel function of SVM

SVM solves the linear non-division

Before the SVM--convex optimization and duality problem

SVM has a wide range of content, intended to be described in five articles. The problem description and basic model of SVM are given by SVM, and the dual problem of SVM is transformed into the solution of duality problem. The kernel function of SVM describes the process of feature-to-high-dimensional mapping of SVM's inductive kernel function SVM's solution to the linear non-fractal description of the SVM to the linear non-divided data processing method; In addition, before SVM-convex optimization and duality problem itself is independent of SVM, but it involves the foundation of SVM optimization problem solving. Note that: The first three are in the data linear can be divided into the case, the fourth is to solve the linear non-divided method; however, it does not mean that the fourth article completely divorced from the overall content of the first three, in contrast, SVM to solve the linear irreducible method of the formal description and linear can be divided the situation is not much difference, So the solution method is the same.

First, linear classifier

$m$ a training sample for input ${{\left\{{{x}^{i}},{{y}^{i}} \right\}}_{i=1\cdots m}},{{x}^{i}}\in {{\re}^{n}},{{y}^{i}}\in \{-1,1\}$ , the linear classifier wants to find a hyper-planar ${{w}^{t}}x+b=0$ that separates the two classes of samples, making the sample ${{w}^{t}}{{x}^{i}}+b>0$ for the sample class ${{y}^{i}}=1$, and for the sample class ${{y}^{i}}=-1 $ of sample ${{w}^{t}}{{x}^{i}}+b<0$. In this way, the hyper-planar ${{w}^{t}}x+b=0$ is a "perfect" classifier. This "perfect" classifier is, of course, very willful, because she asks for the data to be linearly divided, but before she explicitly asks questions that are not linear, assume that the data is as "perfect" as hers.

Second, function interval andGeometry interval

Before continuing the classifier, the first two concepts are inductive function interval and set interval.

For a linear function $h (x) ={{w}^{t}}x+b$, define the $i$ sample to its function interval: \[{{\hat{\gamma}}^{i}}={{y}^{i}} ({{w}^{t}}{{x}^{i}}+b) \]

The meaning of the function interval is not very clear and intuitive, but it is easy to see that it has a nature: the function interval is proportional to the scale of $ (w,b) $, that is, if you make $ ({W} ', {b} ') =k (w,b) $, then ${{{\hat{\gamma}} '}^{i}}={{y}^{ i}} ({{{{w} '}^{t}}{{x}^{i}}+{b} ') =k{{\hat{\gamma}}^{i}}$.

In addition, the function interval for defining a collection of $m$ samples is $\hat{\gamma}=\underset{x}{\mathop{\min}}\,{{\hat{\gamma}}^{i}}$, which is the smallest function interval in all samples.

The meaning of the set interval is much clearer, which is the geometrical distance of the sample to the ${{w}^{t}}x+b=0$ plane.

Point A coordinate is recorded as Vector ${{x}_{a}}$, it to the plane ${{w}^{t}}x+b=0$ projection is point B, point B coordinates are recorded as Vector ${{x}_{b}}$. The distance from point A to plane ${{w}^{t}}x+b=0$ is the length of the vector $\overset{\to}{\mathop{ba}}\,$. In addition, the unit normal vector of planar ${{w}^{t}}x+b=0$ is $\frac{w}{\left\| w \right\|} $, so the vector $\overset{\to}{\mathop{ba}}\,$ can be represented as ${{\gamma}^{a}}\frac{w}{\left\| W \right\|} $, where ${{\gamma}^{a}}$ is the length of the vector $\overset{\to}{\mathop{ba}}\,$, and is also the geometric distance from point A to plane ${{w}^{t}}x+b=0$. Depending on the relationship between Vector ${{x}_{a}}$, ${{x}_{b}}$, and $\overset{\to}{\mathop{ba}}\,$, ${{x}_{b}}={{x}_{a}}-\overset{\to}{\mathop{can be obtained. Ba}}\,={{x}_{a}}-\lambda \frac{w}{\left\| W \right\|} $, and because ${{x}_{b}}$ in plane ${{w}^{t}}x+b=0$, so ${{w}^{t}}{{x}_{b}}+b=0$, two-way joint can get point A to plane ${{w}^{t}}x+b=0$ geometric distance for ${{\gamma}^ {A}} =\frac{1}{\left\| W \right\|} ({{w}^{t}}x+b) $.

Considering the positive and negative results of the sample on both sides of the plane, the geometric interval of the sample $i$ is defined as: \[{{\gamma}^{i}}=\frac{1}{\left\| w \right\|} {{Y}^{i}} ({{w}^{t}}{{x}^{i}}+b) \]

Obviously, the set interval and the function interval meet: \[{{\gamma}^{i}}=\frac{{{{\hat{\gamma}}}^{i}}}{\left\| W \right\|} \]

Similarly, a collection of $m$ samples is defined to a ${{w}^{t}}x+b=0$ set interval of \[\gamma =\underset{i}{\mathop{\min}}\,{{\gamma}^{i}}\]

Three, the largest interval classifier

If the data is linear, then there are more than one set of $ (W,B) $ yes the Hyper-plane ${{w}^{t}}x+b=0$ can classify the data correctly, then a criterion is needed to choose an optimal one.

The maximum interval classifier takes the geometric interval as the standard, from which the geometric interval of the set of all samples is selected to be the largest of the categorical hyper-plane as the optimal classification of the super-plane, that is, such a super-plane allows the collection of all samples to its geometric interval $\gamma $ maximum. and \[\gamma =\underset{i}{\mathop{\min}}\,{{\gamma}^{i}}\], in other words, it is properly categorized so that the nearest sample is quarantined as far away from it as possible, so that the classification plane separates the two types of samples, Ensures a better generalization of the classifier, because if the categorical hyper-plane is relatively close to a class of samples (as in the first three black lines), then the classifier is likely to have a wrong score for the new data near such samples. It can be seen intuitively that such a classification is probably parallel to the "boundary" of the two categories of samples, passing through the middle of the two-class sample, just like the red line in the second picture.

Now the thought is formalized. First, the classifier must correctly classify all the training data, that is, all samples are at least $\gamma $, and the next goal is to maximize the second, so that they can be represented as the following optimization problem: \[\begin{align}\left\{\begin{ Matrix}\underset{w,b}{\mathop{\max}}\,\gamma \\\begin{matrix}s.t. & \frac{1}{\left\| W \right\|} {{Y}^{i}} ({{w}^{t}}{{x}^{i}}+b) \ge \gamma \\\end{matrix} \\\end{matrix} \right.\end{align}\]

$\gamma =\frac{{\hat{\gamma}}}{\left\| W \right\|} based on the relationship between geometric interval and function interval $, the above optimization problem can be further rewritten as: \[\begin{align}\left\{\begin{matrix}\underset{w,b}{\mathop{\max}}\,\frac{{\hat{\gamma}}}{\left \| W \right\|} \\\begin{matrix}s.t. & {Y}^{i}} ({{w}^{t}}{{x}^{i}}+b) \ge \hat{\gamma} \\\end{matrix} \\\end{matrix} \right.\end {align}\]

now watch optimization problem $\underset{w,b}{\mathop{\max}}\,\frac{{\hat{\gamma}}}{\left\| W \right\|} $, the optimization variable is $ (w,b) $, the target function is $\frac{{\hat{\gamma}}}{\left\| w \right\|} $, as mentioned earlier, the function interval $\hat{\gamma}$ with a scale transformation of $ (w,b) $, now optimize the target function $\frac{{\hat{\gamma}}}{\left\| w \right\|} $ (w,b) $ for any scale transformations on the optimization variable is meaningless and does not affect $\frac{{\hat{\gamma}}}{\left\| W \right\|} The value of $. As a result of this relationship between them, we can simplify the solution of the problem, fixed on $\hat{\gamma}$ and $ (w,b) $ One, to optimize the other (which I understand is also relatively vague). In fact, fixed $\hat{\gamma}$ will make the problem easier, now fixed $\hat{\gamma}=1$ (other constants can also), maximizing $\frac{{\hat{\gamma}}}{\left\| W \right\|} $ equivalent to minimizing $\left\| W \right\|$, also equivalent to minimizing $\frac{1}{2}{{\left\| w \right\|} ^{2}}$ (such conversions are convenient for later operations) the above optimization problem is further translated into: \[\begin{align}\left\{\begin{matrix}\underset{w,b}{\mathop{\min}}\,\frac{1}{ 2}{{\left\| W \right\|} ^{2}} \\\begin{matrix}s.t. & {Y}^{i}} ({{w}^{t}}{{x}^{i}}+b) \ge 1 \\\end{matrix} \\\end{matrix } \right.\end{align}\]

Now, the basis of SVM-the maximum interval classifier is formalized as above optimization problem, and it is not difficult to find it is a convex optimization problem, the solution is very convenient. However, we do not directly solve this problem, but instead to solve its duality problem, which I understand mainly because the transformation of the duality problem has a good form, it is easy to get the kernel function. On the duality problem next write again.

The problem formalization of SVM

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More