Visual machine learning Reading notes--------SVM method

Last Update:2016-12-11 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

SVM is a supervised statistical learning method that minimizes experience errors and maximizes geometric edges, known as the maximal interval classifier, and can be used for classification and regression analysis.

First, the basic principle

SVM is a machine learning process, in the high-dimensional space to find a categorical super-plane, the different categories of data sample points, so that the maximum interval between different categories of points, the super-plane is the largest interval super-plane, the corresponding classifier is called the maximum interval classifier, for two classification problem, can describe the spatial characteristics of SVM.

Assuming that the data sample is x1,x2,..., xn, the categorical hyper-plane can be represented as: Wtx-b=0. where x is a point on a categorical super-plane, W is a vector perpendicular to the category of the super-plane, and B is the displacement, which is used to improve the flexibility of the categorical hyper-plane without having to pass through the origin point.

There is a maximum interval between two categorical hyper-planes, which need to know the support vectors in the training sample and the parallel hyper-planes closest to the support vectors, which can be expressed as:

Wtx-b=1

Wtx-b=-1

where w is the normal vector of the super-plane of classification, the length is undecided, and 1 and 1 are only the constants to calculate the convenience, and the other constants can only be opposite to each other.

If a given training sample is linearly divided, then two parallel super planes with the largest spacing can be found, and there are no training samples between the two planes, and the distance between them is 2/| | w| | 2. So minimize | | w| | 2, you can maximize the spacing between the two hyper-planes.

In order to make all the training sample points in the above two parallel required over-plane interval, we ensure that all the training data sample points x1,x2,..., xn meet one of the following conditions, namely

Wtxi-b≥1

Wtxi-b≤-1

Second, the algorithm improvement

The objective function of the hypothetical problem and its constraints are as follows:

Max (1/| | w| |)

S.T. Yi (wtxi+b) ≥1 i=1,..., N

Min (1/2) | | w| | 2

S.T. Yi (wtxi+b) ≥1 i=1,..., N

By transforming the objective function into a standard convex optimization problem, SVM can be solved. Because the objective function is two times, and the constraint is linear, this is a convex two-time planning problem. Although this problem is a standard two-time planning problem, but also has its special structure, through the Langrang dual transformation to the dual variable optimization problem, we can find a more effective method to solve.

By adding a Lagrangian multiplier to each constraint condition, that is to introduce Lagrange multipliers, the constraints can be incorporated into the objective function by Lagrange function.

L (w,b,α) = (1/2) | | w| | 2-∑αi (Yi (wtxi+b)-1)

Order: Θ (w) =max L (w,b,α) (αi≥0)

When a constraint is not met, such as: Yi (wtxi+b) <1 Then there are: Θ (w) =∞

When all constraints are met, then: Θ (w) = (1/2) | | w| | 2 is the first variable to be minimized.

Minimization in case of requirement constraints being met (1/2) | | w| | 2, which is actually equivalent to minimizing θ (w). Because if the constraint is not met, θ (w) will be equal to infinity, not the required minimum value, and the target function becomes:

Using p* to represent the optimal value of this problem is equivalent to the initial problem. By swapping the smallest and largest positions, you can obtain:

After swapping is no longer equivalent to the original problem, the optimal value of this new problem is represented by q*, and q*<p*, because the smallest of the maximum one value is larger than the largest of the minimum value. The optimal solution of the second problem q* provides a lower bound of the optimal solution p* of the first problem, which is equal if certain conditions are met, and the first problem can be solved indirectly by solving the second problem.

Therefore, it can be obtained by first seeking L on the w,b, and then the greater the alpha. The problem of p* transformation from the smallest and maximized primal problems to the dual problem of maximum and minimization is q* because q* is the approximate solution of p*, and it is easier to solve the problem after it is converted to duality.

Typically, this transformation needs to satisfy the KKT condition. Usually an optimized mathematical model can be expressed as the following standard form:

Min f (x)

s.t. h (x) =0, J=1,..., p

GK (x) ≤0, k=1,.., q

X∈xсrn

where f (x) is a function that needs to be minimized, H (x) is an equality constraint, and g (x) is an inequality constraint, and P and Q are the number of equality constraints and inequality constraints, respectively.

Suppose Xсrn is a convex set, a convex function, and a convex function f:x->r. Convex optimization is to find a point x*∈x, so that each point x∈x meet F (x*) ≤f (X).

The KKT condition is a nonlinear programming problem with the sufficient and necessary condition of the optimal solution method. The KKT condition means that the minimum point in the standard form of the above optimal mathematical model x* must meet the following conditions:

HJ (x) =0,j=1,..., p,gk (x) ≤0,k=1,..., Q

ΛJ≠0,ΜK≥0,ΜKGK (x) =0

By observing, the problem here is satisfying the kkt condition, so it can be converted to solve the second problem. In other words, the original problem has been transformed into its duality problem by satisfying certain conditions.

To solve this dual learning problem, it can be divided into three steps: First of all, L (w,b,α) to minimize the W and B, then the maximum of α, and finally use the SMO algorithm to solve the dual factor.

In order to solve the linear irreducible problem, we need to give the definition of the categorical Super plane first. For the data point X to classify, the formula is actually calculated by substituting X into F (x) =wtx+b, and then dividing the category by the sign.

Derived from the foregoing, you can know:

So the classification function can be expressed as: f (x) = (∑αiyixi) tx+b=∑αiyi<xi, x>+b

By observing, we can see that the prediction of the new point x only needs to calculate its inner product with the training data points. This is very important and is the basic premise of using the kernel function for non-linear generalization. All non-support vectors correspond to the coefficient α equals 0, so for the inner product calculation of the new point, it is actually only necessary to target a small number of support vectors, not all training data.

Assuming that the data is linearly separable, you can find a categorical hyper-plane that separates the two types of data completely. To deal with nonlinear data, linear SVM can be generalized using kernel function method. Although mapping φ (•) maps raw data to high-dimensional space, the probability of linear splitting increases significantly, but it is still difficult to handle in some cases. For example, it is not because the data itself is non-linear, but because the data is noisy. For this data point, which deviates from the normal position, it is called the wild value point. In the original SVM model, the existence of outliers is likely to have a significant impact, because the hyper-plane itself is only a small number of support vector composition. If there are wild values in these support vectors, the effect is great.

For the point value on the classification plane is 0, the point value on the edge is between [0,1/l], where L is the number of training data sets, that is, the data set size, and for the wild value point data and the internal data value is 1/L, the original constraint becomes:

Yi (wtxi+b) ≥1-ξi, I=1,..., N

Wherein, ξi≥0 is called the relaxation variable, corresponding to the data point XI allowable deviation of the amount. If the ξi is allowed to be arbitrarily large, then the arbitrary super-plane will meet the conditions. Add an entry after the original target function, so that these ξi are also minimized, namely:

Where c is a parameter that controls the weights between two items in the objective function (looking for the largest plane of the boundary and minimizing the deviation of the data points).

Ξi is a variable that needs to be optimized, and C is a predetermined constant. So there are:

S.T. Yi (wtxi+b) ≥1-ξi i=1,..., N

Ξi≥0,i=1,..., N

The constraint is added to the target function, and a new Lagrangian function is obtained:

The analysis method for the problem is the same as before, after converting to another problem, make L minimize for w,b and ξ:

The W is returned to L and reduced to the same objective function as the original, namely:

Because of getting: c-αi-?i=0

Also:? i≥0

As a condition of Lagrange multipliers, there are: Αi≤c

So the whole duality problem can be expressed as:

s.t. 0≤αi≤c, I=1,..., N

∑αiyi=0

It can be seen that the only difference is that the dual-variable α has a higher limit of c. and the non-linear form of the kernel is also the same, as long as the <xi,xj> to K (XI,XJ), you can obtain a complete processing linear and non-linear, can tolerate noise and outliers point of support vector machine.

Third, the experiment

Visual machine learning Reading notes--------SVM method

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More