Mathematics and Algorithms in SVM

Last Update:2014-07-09 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Support vector machine was first proposed by Cortes and Vapnik in 1995. It has many unique advantages in solving small samples, non-linear and high-dimensional pattern recognition, it can also be applied to other machine learning problems such as function fitting.

I. Mathematics

1.1 Two-Dimensional Space

A typical application of SVM is classification, which is used to solve this problem: Some things can be classified, but we cannot clearly explain how to classify them, for example, the center triangle is the C1 class, and the circle is the C2 class. This is all known. Well, there is another square. Does this square belong to C1 or C2, not clear. The SVM algorithm helps you clarify this matter.

In two-dimensional space (at this time, the sample has two photo attributes), SVM is to draw a line between C1 and C2 g (x) = 0, the line above belongs to the C1 class, the lower part of the line belongs to the C2 class. At this time, the square will come back and we will have a charter.

For g (x) = 0, I have to say a few more times. X in G (x) is not a abscissa, but a vector, not a slope in the resolution ry, but also a vector. Is a vector product. For example, in the resolution geometric sense, the straight line y =-X-B is replaced by the vector notation. Here w is that, and X is that.

For vertices in the C1 class: G (x)> 0; For vertices in the C2 class: G (x) <0;

Suppose we use y to represent the type, + 1 represents the C1 class, and-1 represents the C2 class.

So for all training samples, there are:, then G (x) = 0 is the line that correctly cut all training samples, just put g (x) = 0 this line can be used together.

This can only be used together. Because there are too many g (x) = 0 to meet this condition, we need the best line to pursue perfection. How is it optimal? Intuition tells us that the line g (x) = 0 is not biased towards C1 or C2, so it should be the best. Yes, the learning name is the classification interval, and the length of the red line is the classification interval.

In a two-dimensional space, finding the classification interval can be converted to finding the distance between a point and a line. The distance between a point and a line can be expressed as a vector ). For simple calculation, the whole two-dimensional space is normalized (proportional amplification or reduction) so that all samples have | G (x) |> = 1, that is, let the C1 and C2 classes leave g (x) = 0 recent training samples | G (x) | = 1, then the classification interval is, the larger the interval, the better, so | the smaller the value, the better.

1.2 Multi-Dimensional Space

Now we have abstracted a mathematical problem in a 2-dimensional space, and found that g (x) = 0 meets the following conditions:

That is, the W that can obtain the minimum value if the condition is met. In a two-dimensional space, W can be considered as a slope. when the sample is determined and the slope is determined, the B in the W can also be determined, and the whole is determined by = 0.

Today we only discuss two-dimensional space, but we are pleasantly surprised to find that the conclusions in the two-dimensional space can be easily extended to the multi-dimensional space. For example:

We can still express the cut plane (superplane) in a multi-dimensional space.

The distance from the midpoint to the surface of a multi-dimensional space can still be expressed. For example, in plane representation, X is the projection on the surface, and r is the distance from X to the surface. A simple derivation is as follows:

The W vector is perpendicular to the plane. For example, the preceding formula is obtained and simplified. Therefore, the distance from the vector X to the plane is the same as that in the two-dimensional space.

Now we extend SVM from a 2-dimensional space to a multi-dimensional space, that is, to obtain g (x) = 0 that meets the following conditions:

1.3 Laplace factor

This is a typical Extreme Value Problem with constraints. The objective function is a quadratic function, and the constrained function is a linear function: a quadratic programming problem. The general method for solving the quadratic programming problem is to add the Laplace multiplier to construct the Laplace function (in theory, there should be some additional mathematical conditions. The Laplace method is available, and it will be skipped ).

The detailed solution process is as follows:

1. Construct the Laplace Function

When neutralization B is unknown.

2. Obtain the partial derivative of and B, so that the partial derivative is 0.

, That is

3. Bring back the above formula to the Laplace function to obtain the dual problem of the Laplace system and convert the problem into a solution.

4. Finally, convert the problem into a solution that satisfies the following equations.

1.4 Linearity

Well, now let's sort out the SVM classification logic and find a cut surface (line) in the space to separate the sample points. The optimal condition of the cut surface (line) is to maximize the classification interval, the classification interval is calculated based on the distance between the point and the plane (straight line. The problem is that all the cutting surfaces are flat and all the cutting lines are straight lines? Apparently not.

For example, the feature is the area X of the house, where X is the real number, and Y is the price of the house. If we see that X and Y conform to three curves from the distribution of sample points, we want to use the cubic polynomials of X to approach these sample points.

In the two-dimensional space, this is non-linear, so we cannot use the previous reasoning ------ the distance from the point to the curve? I don't know how to calculate it. However, if X is mapped to a three-dimensional space, it is linear. That is to say, when a non-linear line (plane) in a low-dimensional space is mapped to a high-dimensional space, it can become linear. Therefore, we need to make a small correction to the problem. The problem we face is to solve:

A kernel function is introduced here for the linearity of the sample space.

1.5 relaxation variable

The above is a complete derivation process, but experience shows that the above conditions are left to the computer for solving, basically there is no solution, because the conditions are too harsh. In fact, the most common case is that there is noise in the classification process when the red part is used. If there is no tolerance for the noise, it is very likely that there is no classifier solution.

Relaxation variables are introduced to solve the problem. Correct the original problem as follows:

Introduce the Laplace factor according to the laganisam method:

The derivative obtained from the preceding formula is as follows:

, That is

Bringing back the dual problem of laganisam:

In addition, when the objective function obtains the extreme value, the constraint condition must be at the constraint boundary (kkt condition), that is:

The above formula can draw the following conclusion:

Time: it cannot be zero, that is, the distance from the point to the cut surface is smaller than, which is a false split point.

Time: Zero, greater than zero: indicates that the distance from the point to the cut surface is large, and thus the correct classification point.

Time: zero, which is the support vector.

Use a mathematical language to extract the following:

The partial derivative of the pair is:

The kkt condition can be expressed:

The kkt condition is:

If, then

All is greater than all. Here, B is ignored as the intermediate number, because B can be derived.

Ii. Algorithm

When the number of samples is larger than that of samples (several thousand samples), the memory required by SVM is unacceptable to computers. Currently, there are two solutions to this problem: block algorithm and decomposition algorithm. Here, libsvm merge uses the SMO (Serial Minimization) method in the decomposition algorithm, and only two samples are selected for each training. The basic process is as follows:

There are two important algorithms: one is the choice, and the other is the update.

2.1 Selection Algorithm

Select the two most serious violations of the two and kkt conditions, including two loops:

Outer Loop: traversal of non-boundary samples is preferred. Because non-boundary samples are more likely to need to be adjusted, the boundary samples are often left on the boundary because they cannot be further adjusted. Which of the largest values of all samples found during the traversal process (this sample is the most likely sample that does not meet the conditions.

Inner Loop: For the sample selected in the outer loop, locate the sample so that:

Maximum. The formula above is a formula in the update, indicating the maximum value when the operator is selected and the most updated operator.

Assuming that the kkt condition has been met during the selection process, the algorithm ends.

2.2 Update Algorithm

Because SMO selects only two samples each time, the equation constraints can be converted to linear constraints:

Convert to image representation:

The value range is:

Take it into, get a quadratic equation of one dollar, and obtain the extreme value:

Finally:

2.3 others

As mentioned above, the memory used by SVM is huge, and another drawback is the computing speed. As the data is too large, the computing workload will be large, and the computing speed will obviously decrease. Therefore, a good way is to gradually remove non-computation data from the computation process. As practice has proved, once the Boundary (= 0 or = C) is reached during the training process, the value will not change. As the training progresses, samples and computation samples will become fewer and fewer, and SVM finally returns the support vector (0 <

Libsvm merge uses the calculation method to check the value in active_size. If the boundary is reached, the corresponding sample should be removed (changed to inactived) and placed at the end of the stack, in this way, the active_size is gradually reduced.

The basic calculation formula for B is:

Theoretically, the value of B is indefinite. When the program reaches the optimal value, it only needs to bring a sample of a standard support vector machine (0 <C) to the upper formula, and the obtained B value is acceptable. At present, there are many ways to evaluate B. In libsvm, evaluate B for all the support vectors y = + 1 and y =-1, and then take the average value.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More