Summary of machine learning Algorithms (i)--Support vector machine

Last Update:2018-06-23 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Self-study machine learning three months, exposure to a variety of algorithms, but many know its why, so want to learn from the past to do a summary, the series of articles will not have too much algorithm derivation.

We know that the earlier classification model-Perceptron (1957) is a linear classification model of class Two classification, and is the basis of later neural networks and support vector machines. SVM (Support vector machines) is also a two-class classification model, which has evolved to be able to deal with multiple linear

and non-linear problems, it can also deal with regression problems. Before deep learning is popular, it should be considered as the best classification algorithm. However, there are still many applications of SVM, especially in small sample sets.

1. Perceptual Machine Model

Perceptron model is a linear classifier of two classification, can only deal with the problem of linear separable, the model of Perceptron is to try to find a super plane to separate the data set, in two-dimensional space This super plane is a straight line, in three-dimensional space is a plane. The classification model of the Perceptron is as follows:

The sign function is the indicator function (when wx+b > 0,f (x) = +1; when Wx+b < 0,f (x) =-1; The super plane of the perceptron is wx+b = 0)

By integrating the above piecewise functions into Y (wx+b) > 0, the sample points that satisfy the equation are classified as the correct points, the points of the classification errors are not satisfied, and our goal is to find such a set of parameters w,b that separates the positive and negative class points in the training set.

Next we define our loss function (the loss function is a function of measuring the degree of loss and error), and we can define the number of samples that classify the error as the loss function, but this loss function is not a continuous w,b function of the parameter, so it is not easy to optimize. We know that for the point of mis-classification

With-y (wx+b) > 0, we have all the wrong classification points to the super-plane distance and the smallest ( Note: The Perceptron loss function is only for the mis-classification point, not the entire training set ):

where M is a collection of samples that represent a mis-classification, and when the w,b multiplies, it does not change our hyper-Plane, | | w| | Value increases correspondingly, so that | | w| | = 1 does not affect our results. The final Perceptron loss function is as follows:

2. Support Vector Machine

In the perceptual machine, our goal is to separate the training set, as long as the sample can be separated by the super-plane to meet our requirements, and such a super-plane there are many. Support vector machines are inherently similar to perceptual machines, but they are more demanding, and we know that in the classification process,

Those points away from the super-plane is safe, and those prone to be mistakenly classified points are very close to the super-plane point, and the idea of support vector machine is to focus on these very close to the super-plane point, a word is in the correct classification at the same time, so that the closest point to the super-plane interval maximum.

Based on the perceptual machine above, we can express our goal as:

Gamma is the geometric interval from the nearest point of the plane to the hyper-plane, and the geometric interval is replaced by the function interval, which can be expressed as:

Gamma (hat) represents the function interval, and the value of the function interval varies with the w,b multiplication, and does not affect the final result, so the gamma (hat) = 1, then our final question can be expressed as:

This leads to the first highlight of the support vector machine: Maximizing the interval, maximizing the interval, and making the classification more precise, and the maximum interval is present and unique.

The 1/2| in the above question | w| | 2 is a convex function, while constrained inequalities are affine functions, so this is a convex two-time programming problem, according to the convex optimization theory, we can use the Lagrange function to transform our constrained problem to unconstrained problem to solve, our optimization function can be expressed as:

Like is a Lagrangian multiplier, αi≥0 i = 1, 2, 3, ..., N.

According to the duality of Lagrange, the primal problem can be transformed into duality problem (as long as the duality problem exists, the optimal solution of duality problem is the optimal solution of the original problem, and the general duality problem is easier to solve than the original problem) Minimax problem:

First, the w,b derivation of the minimum problem, you can get the value of W,b:

By substituting the obtained solution into the Lagrangian function, the following optimization function can be obtained (converting the great problem of the original seeking α into the minimax problem):

So just ask for our alpha value to get our W,b value (the common algorithm for α is the SMO algorithm can refer to https://www.cnblogs.com/pinard/p/6111471.html) assuming that the value of the final obtained α is α*, then W, B can be expressed as:

Introduce KTT conditions:

As can be seen from the ktt conditions, when Yi (W*xi + b*)-1 > 0 o'clock, αi* = 0; When αi* > 0 o'clock, Yi (w*xi + b*)-1 = 0;

Combining the above w,b expression can lead to the second highlight of the support vector machine : The w,b parameter is only relevant to the sample that satisfies Yi (w*xi + b*)-1 = 0, and these sample points are the closest point to the maximum interval super-plane, and we call these points support vectors. so a lot of times support vectors can behave well in small sample sets, and that's why. (It is also important to note that the number of alpha vectors is equal to that of the training set, and the large training set leads to an increase in the number of required parameters, so SVM is slower than other common machine learning algorithms when processing large training sets )

3. Maximum soft interval

Usually there are some anomalies in the training set, and these outliers cause the training sets to be linearly non-divided, but after removing the outliers, the remaining samples are linearly divided, and the above-mentioned hard-interval maximization is not able to deal with the problem of linear non-point, Linear irreducible means that the function interval of some sample points is not sufficient

A constraint that is greater than or equal to 1. So we introduce a relaxation variable ξi for each sample (xi, Yi) , then our constraint becomes:

The target function is added to the penalty of the relaxation variable, the penalty parameter C > 0, the target optimization function becomes:

Because the entire solution of the original problem can be described as:

Using the same solution method as before, using Lagrange to transform the constrained problem into unconstrained problem, the original problem is transformed into the duality of the minimax problem, and we can get the final result:

The only difference from the result in the second part is that the value range of the like has an upper limit of C value, and its support vector description is more complicated when the soft interval is maximized, because its support vectors can be on the interval boundary (such as dashed lines in) or between the interval boundary and the hyper plane, or on the side of the separation of the superelevation plane.

4. Hinge Loss function

The hinge loss function, also known as the hinge loss function, has the following expression:

Therefore, the above optimization problem can be described as:

The first of these loss functions is understood to be that when the sample is correctly classified and the interval is greater than 1, i.e. Yi (wxi + b) ≥1, the loss is 0, while Yi (wxi + b) < 1 o'clock, the loss is 1-yi (Wxi + b), Note that even if the sample is classified correctly, the interval is small Also accounting for losses in 1, which is the rigor of support vector machines .

is a comparison of the hinge loss function and some other loss functions:

5, linear non-divided

The above mentioned soft interval maximization can only solve the problem caused by the anomaly, but for its own data set is a non-linear problem is powerless, according to the relevant theory for the low-dimensional space linear irreducible problem, generally mapping it to high-dimensional space after the linear can be divided, we can use this theory

To a support vector machine, can you introduce a function? (x) to map the sample set from the current dimension to a higher dimension, looking back at our previous optimization function:

All we need to do is transform the inner product of the optimization function, Xi XJ, into? (xi) · ? (XJ) can solve our nonlinear problems, but at the same time introduce new problems our data dimension increases, the calculation of the internal product is also increased, when the mapping of the dimension is very high, even after the infinite dimension, the calculation of the model will be significantly increased,

So how to deal with this problem? This requires the introduction of our nuclear function.

We know that even after the mapping to the high dimension, the inner product? (xi) · ? (XJ) value is also still a constant, so is there a function

K (xi xj) =? (xi) · ? (XJ), there is a theory that when such a function is present (Mercer theorem has been proved), we refer to it as the kernel function.

The third highlight of support vector machine is presented here : The kernel function is used to solve the nonlinear classification problem without the need to map the sample to the high dimensional space.

Through the use of nuclear functions to solve our problem, of course, not what function can be used as a kernel function, has been proven not many kernel functions, and the common kernel functions are just a few, next we introduce the following common kernel functions:

1) Linear kernel function

The linear kernel function is well understood and can only be used to deal with linear problems with the following expressions:

So we can put linear SVM and nonlinear SVM together, only by setting kernel functions and processing them

2) polynomial kernel function

The values of the a,c,d are required to be set by the assistant.

3) Radial basis core function

The radial basis kernel function is also called the Gaussian kernel function.

Fewer parameters, only need to set the parameter σ

4) Sigmoid kernel function

K (x, y) = tanh (ax • z + R)

You need to set the debug parameter A,r, wherein the TANH function hyperbolic tangent function, is also commonly used as an activation function in the neural network.

According to Taylor's expansion, we know that the functions of Gao Jieko can be represented by polynomial functions, where both radial and sigmoid kernel functions are high-order, so they can be expressed in terms of polynomial. In other words, both radial and sigmoid kernel functions can represent higher order polynomial, so when you select a kernel function,

The radial basis kernel function and the sigmoid kernel function are usually better than the polynomial kernel function, because the order can be automatically matched, without requiring us to specify the order of the polynomial kernel function, and the polynomial kernel function has more parameters to debug, so the radial basis kernel function is usually chosen.

6. Summary

In general, SVM is basically the best classification algorithm before the integration algorithm and neural network are popular, but even today, it still occupies a higher position.

The main advantages of SVM are:

1) Introduction of maximum interval, high classification accuracy

2) When the sample size is small, can also be accurately classified, and has a good generalization ability

3) The introduction of kernel functions can easily solve nonlinear problems

4) can solve the classification of high-dimensional features, regression problems, even if the feature dimension is larger than the sample data, can also be very good performance

The main disadvantages of SVM are:

1) When the sample size is very large, the calculation of the inner product of the kernel function, the calculation of the solution of the Lagrange multiplier α is related to the number of samples, resulting in the calculation of the model is too large

2) The selection of kernel functions is usually not clearly directed, sometimes it is difficult to choose a suitable kernel function, and like a polynomial kernel function, there are many parameters to debug.

3) SVM is sensitive to missing data (as if many algorithms are sensitive to missing values, for missing values, either in feature engineering or with a tree model).

Summary of machine learning Algorithms (i)--Support vector machine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More