Mathematics and Algorithms in SVM

Last Update:2018-12-03 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Support vector machine was first proposed by Cortes and Vapnik in 1995. It has many unique advantages in solving small samples, non-linear and high-dimensional pattern recognition, it can also be applied to other machine learning problems such as function fitting.

I. Mathematics

1.1 Two-Dimensional Space

A typical application of SVM is classification, which is used to solve the following problem: Some things can be classified, but we cannot clarify how to classify them, for example, the center triangle is the C1 class, and the circle is the C2 class. This is all known. Well, there is another square. Does this square belong to C1 or C2, not clear. The SVM algorithm helps you clarify this matter.

In two-dimensional space (at this time, the sample has two reference attributes), SVM is to draw a line between C1 and C2 g (x) = 0, and the line above belongs to the C1 class, the lower part of the line belongs to the C2 class. At this time, the square will come back and we will have a charter.

For g (x) = 0, I have to say a few more times. X in G (x) is not a abscissa, but a vector, not a slope in the resolution ry, but also a vector. Is a vector product. For example, in the resolution geometric sense, the straight line y =-X-B is changed to vector notation, where W is and X is.

For vertices in the C1 class: G (x)> 0; For vertices in the C2 class: G (x) <0;

If we use y to represent the type, + 1 represents the C1 class, and-1 represents the C2 class.

So for all training samples, there are:, then G (x) = 0 will be able to correctly split the line of all training samples, as long as G (x) = 0 this line can be used together.

This can only be used together, because there are too many g (x) = 0 to meet this condition. What we want to pursue perfection is the best line. How is it optimal? Intuition tells us that the line g (x) = 0 is not biased towards C1 or C2, so it should be the best. Yes, the learning name is the classification interval, and the length of the red line is the classification interval.

In a two-dimensional space, finding the classification interval can be converted to finding the distance between a point and a line. The distance between a point and a line can be expressed as a vector ). For simple calculation, the whole two-dimensional space is normalized (proportional amplification or reduction) so that all samples have | G (x) |> = 1, that is, let the C1 and C2 training samples closest to g (x) = 0 | G (x) | = 1, then the classification interval is, the larger the interval, the better, so | the smaller the value, the better.

1.2 Multi-Dimensional Space

Now we have abstracted a mathematical problem in a 2-dimensional space, and found that g (x) = 0 meets the following conditions:

That is, the W that can obtain the minimum value if the condition is met. In a two-dimensional space, W can be considered as the slope. when the sample is determined and the slope is determined, the B in the W can also be determined, and the whole is determined by = 0.

Now we only discuss two-dimensional space, but we are pleasantly surprised to find that the conclusions in the two-dimensional space can be easily extended to the multi-dimensional space. For example:

We can still express the split surface (hyperplane) in a multi-dimensional space.

The distance from the midpoint to the surface of a multi-dimensional space can still be expressed. For example, the plane is represented as, X is the projection on the surface, and r is the distance from X to the surface. A simple derivation is as follows:

The W vector is perpendicular to the plane. For example, the preceding formula is obtained and simplified. Therefore, the distance from the vector X to the plane is the same as that in the two-dimensional space.

Now we extend SVM from a 2-dimensional space to a multi-dimensional space, that is, to obtain g (x) = 0 that meets the following conditions:

1.3 Laplace factor

This is a typical Extreme Value Problem with constraints. The objective function is a quadratic function, and the constrained function is a linear function: a quadratic programming problem. The general method for solving the quadratic programming problem is to add the Laplace multiplier to construct the Laplace function (in theory, there should be some additional mathematical conditions. The Laplace method is available, and it will be skipped ).

The specific steps are as follows:

1. Construct the Laplace Function

And B are unknown.

2. Obtain the partial derivative of and B, so that the partial derivative is 0.

, That is

3. Bring back the above formula to the Laplace function to obtain the dual problem of the Laplace system and convert the problem into a solution.

4. Finally, convert the problem into a solution that satisfies the following equations.

1.4 Linearity

Well, now let's sort out the SVM classification logic and find a split surface (line) in the space to separate the sample points. The optimal condition of the split surface (line) is to maximize the classification interval, the classification interval is calculated based on the distance between the point and the plane (straight line. The problem is that all split surfaces are flat and all split lines are straight lines? Apparently not.

For example, the feature is the area X of the house, where X is the real number, and Y is the price of the house. Assuming that X and Y conform to three curves from the distribution of sample points, we want to use the cubic polynomials of X to approach these sample points.

In the two-dimensional space, this is non-linear, so we cannot use the previous reasoning ------ the distance from the point to the curve? I don't know how to calculate it. However, if X is mapped to a three-dimensional space, it is linear. That is to say, when a non-linear line (plane) in a low-dimensional space is mapped to a high-dimensional space, it can become linear. So we need to make a small correction to the problem. The problem we face is to solve:

A kernel function is introduced here for the linearity of the sample space.

1.5 relaxation variable

The above is a relatively complete derivation process, but experience shows that the above conditions are left to the computer for solving, basically there is no solution, because the conditions are too harsh. In fact, the most common situation is the red part, which may cause noise in the classification process. If there is no tolerance for noise, it is likely that there is no classification solution.

Relaxation variables are introduced to solve this problem. Correct the original problem as follows:

The following describes the introduction of the Laplace factor based on the laganisam method:

The derivative obtained from the preceding formula is as follows:

, That is

Bringing back the dual problem of laganisam:

In addition, when the objective function obtains the extreme value, the constraint condition must be at the constraint boundary (kkt condition), that is:

The above formula can draw the following conclusions:

Time: it can not be zero, that is, the distance between the point and the split surface is smaller than, it is a false split point.

Time: Zero, greater than zero: indicates that the distance from this point to the split surface is large, so it is the correct classification point.

Time: zero, which is the support vector.

Use a mathematical language to extract the following:

The partial derivative of the pair is:

The kkt condition can be expressed:

The kkt condition is:

If, then

All are greater than all. Here, B is ignored as the intermediate number, because B can be derived.

Ii. Algorithm

When the number of samples is large (several thousand), the memory required by SVM cannot be borne by computers. Currently, there are two solutions to this problem: block algorithm and decomposition algorithm. Here, libsvm uses the SMO (Serial Minimization) method in the decomposition algorithm, and only two samples are selected for each training. The basic process is as follows:

There are two important algorithms: one is the choice, and the other is the update.

2.1 Selection Algorithm

Select the two most serious violations of the two and kkt conditions, including two loops:

Outer Loop: preferentially traverse non-boundary samples, because non-boundary samples are more likely to need to be adjusted, while the boundary samples are often left on the boundary without further adjustment. Which of the largest values of all samples found during the traversal process (this sample is the most likely sample that does not meet the conditions.

Inner Loop: For the sample selected in the outer loop, find such a sample so that:

Maximum. The formula above is a formula in the update, indicating the maximum value when the operator is selected and the most updated operator.

If the kkt condition is met during the selection, the algorithm ends.

2.2 Update Algorithm

Since SMO selects only two samples at a time, the equation constraints can be converted to linear constraints:

Convert to image representation:

The value range is:

Take it into, get a quadratic equation of one dollar, and obtain the extreme value:

Finally:

2.3 others

As mentioned above, the memory used by SVM is huge, and another defect is the computing speed. Because of the large data size, the computing workload is large, and the computing speed is obviously reduced. Therefore, a good way is to gradually remove the data that is not involved in the calculation process. It is proved that, once the Boundary (= 0 or = C) is reached during the training process, the value will not change. As the training progresses, the number of samples involved in the calculation will become fewer and fewer, SVM final result support vector (0 <

Libsvm uses the policy to detect values in active_size During calculation. If the values reach the boundary, the corresponding samples should be removed (changed to inactived) and placed at the end of the stack, in this way, the active_size is gradually reduced.

The basic calculation formula for B is:

Theoretically, the value of B is indefinite. When the program reaches the optimum, As long as any sample of the Standard Support Vector Machine (0 <C) is used to bring the above formula, the obtained B value is acceptable. Currently, there are many methods to evaluate B. In libsvm, B is obtained for all the support vectors y = + 1 and y =-1, and then the average value is obtained.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More