A concise tutorial on "technical translation" support Vector machine and its assistant in Python and R

Source: Internet
Author: User
Tags svm

Original: Simple Tutorial on SVM and Parameter Tuning in Python and R

Introduced

Data is an important task in machine learning, and support vector Machine (SVM) is widely used in the problem of pattern classification and nonlinear regression. The SVM is initially made up of N. Vapnik and Alexey Ya. Chervonenkis

Presented in 1963. Since then, various support vector machines have been successfully used to solve a variety of real-world problems, such as text clustering, image classification, bioinformatics (protein classification, cancer classification), handwritten character recognition and so on.

Content

1. What is support vector machines (SVM)?

2. How does the support vector machine work?

3. Derivation of support vector machine

4. Advantages and disadvantages of support vector machines

5. Support vector machine implementation under Python and R

What is a support vector machine?

Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used to solve classification problems and regression problems. It uses a technique called the nuclear method (kernel trick) to convert the data,

Based on these transformations, it finds the optimal boundary in a variety of possible solutions.

In short, in order to separate data from data labels, SVM makes some very complex transformations of the data. In this article, we will only discuss SVM classification algorithms.

How does a support vector machine work?

The most important thing to understand about how SVM works is to understand what the best-delimited hyper-plane (optimal separating hyperplane) is for maximizing the training data interval. We have a single data to approach this goal.

What is a categorical hyper-plane?

For the above picture, we are able to separate the data. For example, we can draw a line in the middle of the data, the line is full of green data points, the line is full of red data points.

So the question comes, it is clearly a line, why do you call it super plane?

In the above diagram, we only consider the simplest case where the data is distributed on the 2-dimensional plane surface. However SVM can also work in the general n-dimensional space. In higher dimensional space, the hyper plane is the general condition of the plane.

Like what:

1-dimensional data, one point representing the hyper-plane

2-dimensional data, a line representing the hyper-plane

3-dimensional data, a plane represents the super-plane

For higher dimensional data, it is called a hyper-plane

As we mentioned earlier, the goal of SVM is to find the optimal separating hyper plane. So what kind of super-plane is optimal? The fact is that there is a separation of the hyper-plane and does not necessarily mean that it is optimal.

Let's take a few pictures to understand the problem.

1. Many super-planar

There are many super planes, that is the separation of the super plane? It is easy to see that line B is a super plane that distinguishes two classes very well.

2. Many separate hyper-planes

There may also be multiple cases where the hyper-plane is separated. How do we find the optimal separation of the super plane, intuitively, if we choose a distance from a class of very close to the super-plane, then its generalization ability is certainly not very good. So find the hyper-plane that is as much as possible from the points in each class.

In, the optimal super plane is the Super plane B.

Therefore, maximizing the distance between the nearest point of the super plane and the super plane in each class will result in the optimal separation of the super plane. This distance is called the boundary (margin).

The goal of SVM is to find the optimal hyper-plane, because it can not only classify the existing data, but also can predict the unknown data. The optimal hyper-plane is the super-plane with the maximum boundary (margin).

Mathematical derivation

Now that we have a rough idea of the basic concepts behind the algorithm, let's look at the mathematical derivation of SVM

I assume that you have a basic understanding of basic vectors, vector algebra, mathematics, and orthographic projections. These concepts can be described in this article: linear Algebra in machine learning

Super plane equation

You must know that the linear equation can be: $y = mx + c $, M is the slope of the line, and C is the Y-intercept of the line.

The equation for the more general hyper-plane is as follows:

$ $W ^TX = 0$$

, $x $ and $w$ are vectors, $W ^tx$, which represent the dot product of the vector, $W $ is usually called the weight vector.

For the above line $y-mx-c = 0$. In this case:

$w = \begin{pmatrix}

-c\\

-m\\

1

\end{pmatrix}$

$x = \begin{pMatrix}

1\\

x\ \

y  

\end{matrix}$

$W ^TX = 0$

are just two different expressions of the same thing. So why do we use $W^TX = 0$. Just because this expression is better dealt with in the face of high-dimensional data. $W $ represents the vector of the vertical plane, which is useful when we calculate the point to the super-plane distance.

Understanding constraints

The training data in our classification problem is $\{(x_1,y_1), (X_2, y_2),..., (x_n,y_n) \}\in r^n * \{-1, 1\}.$ This means that the training data is some $x_i$, n-dimensional vectors, $y _i$ are $x_i$ tags, $y _i $ = 1 means that the eigenvector $x_i$ belongs to Category 1, which in turn belongs to Category-1.

In a classification problem, we try to find a function $y = f (x): R^n\rightarrow \{-1,1\}$, the function is learned from the training data. We then use this function to predict the categories of unknown data.

F (x) There are countless possibilities, and we must impose a restriction on him to reduce the range of f (x). In the case of SVM, F (x) must meet $W^TX = 0$

can also be represented as $\vec{w}\cdot\vec{x} + b = 0; \VEC{W} \in r^n and b \in R $

This divides the input space into two parts, Part 1 category, and Part 1 category

In the next article, we will consider a 2-dimensional vector, assuming that the $h_0$ is a delimited hyper-plane of data that meets the following conditions:

$\VEC{W}\CDOT\VEC{X} + b = 0 $

In the case of $h_0$, we can select two other hyper-planar $h_1$ and $h_2$, and they can also separate the data and satisfy the following equations:

$\VEC{W}\CDOT\VEC{X} + b =-\ Sigma $ $\vec{w}\cdot\vec{x} + b = \sigma $

This makes the distance between the $h_0$ and the $h_1$ and the $h_2$ equal

The variable $\sigma$ is indeterminate, so we can make \sigma = to simplify the problem.

$\VEC{W}\CDOT\VEC{X} + B =-1 $ $\vec{w}\cdot\vec{x} + b = 1 $

Next we want to make sure there is no other point between the two. So we will choose a hyper plane that meets the following constraints, for each vector $x_i$:

Either: $ for x_i belonging to Class 1 \vec{w} \cdot \vec{x} \leq-1$

Either : $ for x_i belonging to Class 1 \vec{w} \cdot \vec{x} \geq 1$

The above restrictions can be integrated into one-allele: $y _i (\vec{w}\cdot\vec{x_i}) \geq 1 i \in [1,n]$

For brevity we omit the deduction of the calculation margin: margin is represented by M:

$m = \frac{2}{| | \vec{w}| |} $

The only variable in the equation is $w$, so maximize margin we can convert to minimize $| | \vec{w}| | $, the optimization goal can be converted to the following:

$ $Min \frac{| | \vec{w}| |} {2}$$

$ $S. T. y_i (\vec{w} \cdot \vec{x_i}+b) \geq 1, for \forall i = 1,..., n$$

The above equation is valid when our data is completely linear, but sometimes our data is not completely linear, or the data is disturbed by noise, even if it is linear, but the noise will make the found super plane is not optimal.

For this problem we introduce a relaxation variable (slack variable), which allows some points to fall within the classification interval, but we do some penalty for these points:

In this case, the algorithm tries to make the relaxation variable 0. The minimization of the algorithm is not a summary of the wrong classification, but the minimization of the total error distance.

The restrictions now change to: $y _i (\vec{w}\cdot\vec{x_i} + B) \geq 1-\varsigma_i, \forall 1 \geq i \leq n_i, \varsigma_i \geq 0$

The optimization goal becomes:

The parameter $c$ is the balance between the size of the relaxation variable and the size of the classification interval controlled by the regularization parameter.

Small $c$ will make it easier to ignore points that cross the boundary, thus increasing the boundary.

Large $c$ will make it more difficult to ignore the points crossing the boundary, thus reducing the boundary.

For $c = \inf$, all constraints are mandatory.

For the 2-dimensional plane surface, the best way to classify is a straight line, for the 3-dimensional space, the best classification will be a plane. But it is not always possible to use straight lines or planes to achieve the perfect classification, sometimes we need to use nonlinear regions to separate categories. SVM uses kernel functions to solve this nonlinear classification problem, the kernel function can map the data to a different space, in this space, we can use the linear super-plane to separate the data. This method is called a nuclear method.

Assuming that $\phi$ is a kernel function that maps $x_i$ to $\phi (x) $, the restriction becomes:

$ $y _i (\vec{w}\cdot\vec{x_i} +b) \geq 1-\varsigma_i, \forall 1 \leq i \geq N, \varsigma_i \geq 0$$

The optimization goal becomes:

Here we do not go into how to solve this optimization problem, the most common method of solving this kind of problem is convex optimization (convex optimization)

Advantages and disadvantages of SVM

For a specific data set, each classification algorithm has its own advantages and disadvantages. The advantages of SVM are as follows:

    • The convex optimization nature of the optimization objective guarantees the feasibility of finding the optimal solution. Moreover, the optimal solution is the global optimal solution rather than the local optimal solution.
    • SVM is suitable for both linear and non-linear data (using the kernel method), as long as the corresponding penalty parameter C can be found.
    • For high-dimensional data and low-dimensional data, SVM is effective, even in high-dimensional data svm can also work very efficiently, because the characteristics of SVM is only determined by the number of support vectors, rather than the entire data dimension, non-support vectors outside the data points for SVM does not matter, even can be removed.

The SVM disadvantages are as follows:

    • They are not suitable for large data set training because of the long training time and the high computational performance requirements
    • SVM is not very effective for noisy data with overlapping categories

SVM under Python and R

The most common library for implementing machine learning algorithms under Python is Scikit-learn, and the classification function of SVM under Scikit-learn is Svm.svc ()

Sklearn.svm.SVC (c = 1.0, kernel = ' RBF ', degree=3, gamma= ' auto ')

The parameters are as follows:

C: Regularization parameters

Kernel: The kernel function used in the algorithm can be ' linear ', ' poly ', ' rbf ', ' sigmoid ', ' precomputed ', or a user function. The default value is ' RBF '

Degree: Dimension in ' poly ' polynomial kernel function, default is 3, other kernel functions ignore this parameter

Gamma: ' RBF ', ' poly ', ' sigmoid ' coefficients, if gamma is ' auto ' then default to 1/n features

Here are a lot of parameters I did not mention in this article, further understand you can view here.

We can optimize the SVM by changing the parameters C, \gamma and kernel functions, and the parameter optimization function in Scikit-learn is Gridsearch ().

Sklearn.model_selection. GRIDSEARCHCV (estimator, Param_grid)

The parameters are as follows:

Estimator: This is the object I'm estimating, like this is SVM. SVC ()

Param_grid: A dictionary or list that contains the name of the parameter to optimize and the value to optimize

In the above code, we want to consider the parameters of the optimization is C and gamma, the optimal values of these parameters in our given parameters, here we just give some values, we can also give a range of values, but the program will take longer to execute.

A concise tutorial on "technical translation" support Vector machine and its assistant in Python and R

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.