Machine Learning Public Lesson Note (7): Support Vector machine

Source: Internet
Author: User
Tags dashed line svm

Support Vector Machines (SVM)

Considering logistic regression, for $y=1$ data, we want its $h_\theta (x) \approx 1$, corresponding $\theta^tx \gg 0$; For $y=0$ data, we want $h_\theta (x) \approx 0$, corresponding $\theta^tx \ll 0$. The cost of each data point is: $$-\left[y\log (H_\theta (x)) + (1-y) \log (1-h\theta (x)) \right]$$ when $y=1$ its cost $cost_1 (z) =-\log (\frac{1}{1+e^ {-Z}}) $, when $y=0$ its cost $cost_0 (z) =-\log (1-\frac{1}{1+e^{-z}}) $, respectively, as shown in about 1.

Figure 1 when the cost of $y=1$ and $y=0$ individual data points changes with $z$

The logistic regression is assumed to be $$\min\limits_\theta \frac{1}{m}\left[\sum\limits_{i=1}^{m}y^{(i)} (-\log (H_\theta (x^{(i))) + (1-y ^{(i)}) (-\log (1-h_\theta (x^{(i))) \right] + \frac{\lambda}{2m}\sum\limits_{j=1}^{n}\theta_{j}^2$$ by removing $\frac{1}{ m}$ and changes the form of $a+\lambda b$ into $ca+b$ form, the assumption of SVM can be $$\min\limits_\theta c\left[\sum\limits_{i=1}^{m}y^{(i)}cost_1 (\ theta^tx^{(i)}) + (1-y^{(i)}) cost_0 (\theta^tx^{(i)}) \right] + \frac{1}{2}\sum\limits_{j=1}^{n}\theta_{j}^2$$

Maximum interval (Large Margin intuition)

For $y=1$ data, we want $\THETA^TX \ge 1$ and not just $\ge 0$; For $y=0$ data, we want $h_\theta (x) \leq-1$ and not just $< 0$. When C is very large, because the $ca+b$ to take the minimum value, so $a\approx 0$, thus the SVM maximization interval problem becomes \BEGIN{ALIGN*}&\MIN\LIMITS_\THETA\FRAC{1}{2}\SUM\LIMITS_{J =1}^{n}\theta_j^{2}\\ &s.t.\quad\begin{cases}\theta^{t}x^{(i)}\geq 1 \quad y^{(i)}=1\\\theta^{t}x^{(i)}\leq-1 \ Quad y^{(i)}=0\end{cases}\end{align*}

Figure 2 Support vector machine optimal hyper-plane and maximum margin (distance between two dashed lines), support vector refers to the point on the two dashed line (in the high-dimensional space, each point can be considered as a vector from the origin, so these points are also called vectors)

The parameter C indicates tolerance for errors, the larger the C value, the more intolerable the error classification, the more prone to overfitting (shown in 3).


Figure 3 The effect of the parameter C on the classification, the greater the C, the more can not tolerate error classification, conversely, can accept a small number of error classification.

Kernel

In the case of linear irreducible in the low dimensional space, it can be mapped to the high-dimensional space by increasing the higher order polynomial items, which makes it possible to divide (4) through a super plane in the high dimensional space. Kernel solves how to select the appropriate high-dimensional feature problem, for any low-dimensional data point x, to define the similarity between it and the pre-selected mark point in the low-dimensional space $l^{(i)}$ (landmarks) is $ $f _i=similarity (x, l^{(i)}) =exp\ Left (-\frac{| | x-l^{(i)}| | ^2}{2\sigma^2}\right) $$ So when there are k landmarks, we will get K new feature $f_i$, which will cast the point x in the low dimensional space as a point in the high dimensional (k-dimensional) space. If x distance $l^{(i)}$ very near, namely $x\approx l^{(i)}$, then similarity $f_i\approx 1$; Otherwise, if X is very far from $l^{(i)}$, then similarity $f_i\approx 0$. The next question is how to choose landmarks, one approach is to select all M sample points as landmarks, so for data points with n features $x^{(i)}$, by calculating $f_i$, a data point with M-dimensional space $f^{(i)}$ is obtained.


Figure 4 uses the kernel function to project a low-dimensional linear non-fractal data into a high-dimensional space so that the linear can be divided.

SVM with kernels

Hypothesis (hypothesis): Calculates a new feature $f$ given the $x$ of a data point. When $\THETA^TF \geq 0$, predict $y=1$, and conversely, predict $y=0$.

Training (Training): $$\min\limits_\theta c\left[\sum\limits_{i=1}^{m}y^{(i)}cost_1 (\theta^tf^{(i)}) + (1-y^{(i)}) Cost_0 ( \theta^tf^{(i)}) \right] + \frac{1}{2}\sum\limits_{j=1}^{n}\theta_{j}^2$$

Effect of parameter C ($\approx\frac{1}{\lambda}$):

    • Large c:low bias, high variance
    • Small c:high bias, low variance

Effect of parameter $\sigma^2$:

    • Large $\sigma^2$:high bias, low variance ($f _i$ vary more smoothly)
    • Small $\sigma^2$:low bias, high variance ($f _i$ vary less smoothly)
Support Vector Machine Practice

The actual application does not require itself to implement SVM, more is called the thread of the library such as LIBLINEAR,LIBSVM to solve. You need to specify the parameters C and kernel functions.

Linear kernel: Do not specify kernel, which is "no kernel", also known as "Linear kernel" (for large feature n, while example number M is small).

Gaussian kernel: $f _i=exp\left (-\frac{| | x-l^{(i)}| | ^2}{2\sigma^2}\right) $, where $l^{(i)}=x^{(i)}$, you need to specify the parameter $\sigma^2$ (for n smaller, m larger). Note that the data needs to be feature scaling before using Gaussian kernel.

Other common kernel include

    • Polynomal kernel: $k (x, L) = (\alpha x^tl+c) ^{d}$, where adjustable parameters include slope $\alpha$, constant $c$, and polynomial degrees $d$
    • String kernel: Directly transform the string, do not need to numeric string, the specific formula see wikipedia:string kernel
    • Chi-Square kernel: $k (x, y) =1-\sum\limits_{k=1}^{n}\frac{(x_k-y_k) ^2}{\frac{1}{2} (X_k+y_k)}$
    • Histogram intersection kernel: $k (x, y) = \sum\limits_{k=1}^{n}\min (X_k, Y_k) $

Multivariate classification: Using One-vs-all method, for K classification, we need to train K SVM.

Logistic regression and SVM

When n is large ($n \geq m$, n=10000, M = 1000), use logistic regression or SVM (with linear kernel)

When n is smaller, M medium (n=10-1000, M = 10-100000), use SVM (with Gaussian kernel)

When n is smaller, M is larger (n=1-1000, M = 500000), add new features, then apply logistic regression or SVM (with linear kernel)

The neural network works well in all kinds of n, m cases, and the defect is that the training speed is slow.

Reference documents

[1] Andrew Ng Coursera public class seventh week

[2] Kernel Functions for machine learning applications. http://crsouza.com/2010/03/kernel-functions-for-machine-learning-applications/#chisquare

[3] wikipedia:string kernel. Https://en.wikipedia.org/wiki/String_kernel

[4] Hofmann T, Schölkopf B, Smola A J. Kernel methods in Machine Learning[j]. The Annals of Statistics, 2008:1171-1220.

Machine Learning Public Lesson Note (7): Support Vector machine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.