A personal understanding of SVM

Last Update:2015-06-19 Source: Internet

Author: User

Tags svm rbf kernel

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A personal understanding of SVM

Before thinking that SVM is very powerful and mysterious, I understand the principle is not difficult, but, "the master's skill is to use the idea of mathematics to define it, using physical description of it," this point in the mathematical part of the SVM has been deeply realized, the least squares, gradient descent method, Lagrange multiplier, The duality problem and so on are engaged in the burn. After listening to the lecture in Pui Yuen, it was clear to understand the whole mathematical derivation of the ins and outs.

1. Why should we study linear classification?

First of all, why the data set must be said to be linear separable or linear is not separable, can not be non-linear separation? Want to be non-linear apart of course, SVM simply maps the original linearly irreducible data points to a new space and translates them into linear separable data in the new space for classification. If you return to the original data space, it is still non-linear separate. But why not just separate the nonlinearity in the original data space, but instead go to the new space to separate it linearly? First, nonlinear separation is much more complex than linear separation. Linear separation as long as a straight line or a plane, and so on, can be said to be the simplest form of expression in the curve. and the non-linear separation of the situation is more. In the case of two-dimensional space only, there are too many curves, polylines, hyperbolic, conic, wavy lines, and irregular curve, and there is no way to deal with them uniformly. Even if we can deal with a specific problem to obtain a non-linear classification results, and can not be well extended to other situations, so that every specific problem to a mathematician to build a curve model, too troublesome and not so much time and energy. Therefore, the use of linear classification one is because it is simple, the nature is easy to study thoroughly, second, because of its strong ability to promote, after the study, all other problems are solved, no need to build other models. So, although SVM is more of a step to map raw data to a new space, it seems to increase the workload, and how to find new mapping space is not easy to look at, but, overall, after the study, it will save a lot of effort than other methods.

2. What is the idea of SVM? 2.1 Hard interval Support vector machine

One of the most critical ideas in SVM is the introduction and definition of the concept of "interval". The concept itself is simple, with two-dimensional space as an example, that is, the distance between the point and the categorical line. Assuming that the line is y=wx+b, the line is the best categorical line as long as the sum of the distance from all the positive classification points to the line is maximized from all negative classification points to the straight line. In this way, the original problem is transformed into a constrained optimization problem, which can be solved directly. This is called hard interval maximization, and the resulting SVM model is called the hard interval support vector machine .

2.2 Soft interval Support vector machine

But new problems arise, and in practical applications, the data we get are not always perfectly linear, where there may be individual noise points that they mistakenly classify into other classes. If these specific noise points are removed, they can be easily divided linearly. However, we do not know which of the data sets are noise points, and if solved in the previous method, it will not be linearly separated. Is there no way out? Assuming that the y=x+1 line is divided into two categories, if each of the two categories have each other's several noise, in the eyes of the person, can still be divided into two categories. This is because in the human brain can tolerate a certain error, still use y=x+1 line classification, can be the smallest error in the case of the best classification. In the same way, we introduce the concept of error in SVM and call it " relaxation variable ". By adding relaxation variables, it is necessary to add the error of new relaxation variables in the original distance function, so that the final optimization objective function becomes two parts: distance function and relaxation variable error. The importance of these two parts is not equal, but needs to be based on specific problems, so we add the weight parameter C, and the objective function of the relaxation variable error multiplied, so that you can adjust the C to reconcile the coefficients of the two. If we can tolerate noise, then the C is small, let his weight down, and thus become unimportant; on the contrary, we need a very strict model of noise, it will be a little bit C, the weight up, become more important. By adjusting the parameter C, the model can be controlled. This is called soft interval maximization, and the resulting SVM is called a soft interval support vector machine .

2.3 Nonlinear Support vector machines

The previous hard interval support vector machine and the soft interval support vector machine are the problems of solving linear data sets or approximate linear data sets. But what if there is a lot of noise, and even the data becomes linearly non-divided? The most common example is in a two-dimensional plane Cartesian coordinate system, with the origin (0,0) as the center, with a radius of 1 to draw a circle, then the point in the circle and the point outside the circle in the two-dimensional space is certainly not linearly separated. However, learning the geometry of junior middle School knows that the point within the circle (including The circle): x^2+y^2≤1, outside the circle is x^2+y^2>1. We assume that the third dimension: Z=x^2+y^2, then in the third dimension space, you can determine whether the point is inside or outside the circle by whether the z is greater than the first. In this way, the linearly irreducible data in a two-dimensional space can be easily divided in the third dimensional space. This is the nonlinear support vector machine .

This is the very important idea of SVM. For data that is not linearly divided in n-dimensional space, the space above the n+1 dimension is larger to the possibility of becoming linearly divided (but not necessarily linearly on the n+1 dimension. The higher the dimension, the more likely it is to be linearly divided, but not fully guaranteed. Therefore, for linearly non-divided data, we can map it into a new space which can be divided linearly, and then we can solve it by using the hard interval support vector machine or the soft interval support vector machine just mentioned. In this way, we turn the original problem into how to map the original data so that it can be linearly divided in the new space. In the example above, the mapping can be done by observing the equations of the circle, but it is certainly not so simple in the actual data. If you can observe the law, then there is no need for the machine to do SVM.

In practice, it is very difficult to find a suitable space for a real problem function, fortunately, in the calculation, we need only two vectors in the new mapping space of the inner product results, and the mapping function exactly what is not need to know. This is not very good understanding, some people will ask, since do not know the mapping function, how can we know the mapping in the new space in the inner product result? The answer is in fact possible. This requires the introduction of the concept of kernel functions. The kernel function is such a function: Still take two dimensional space as an example, assuming for the variables x and Y, mapping it to the new space mapping function is φ, in the new space, they correspond to φ (x) and φ (y), their inner product is <φ (x), φ (y) >. We make the function kernel (x, y) =<φ (×), φ (y) >=k, as you can see, the function kernel (x, y) is a function of x and Y! And it has nothing to do with φ! What a good nature it is! We no longer have to do with φ specifically what mapping relationship, only need to calculate kernel (x, y) to get their inner product in the high-dimensional space, so that you can directly into the previous support vector machine calculation! Really mother no longer have to worry about my study.

After getting this delightful function, we need to calm down and ask: Where does this kernel function come from? And how did he get it? Can you really solve all the problems that map to high-dimensional space?

I'll try to answer that question if I understand the right thing. Kernel functions are not well-found and are generally derived or pieced together by mathematicians. Now we know that there are polynomial kernel functions, Gaussian kernel functions, string kernel functions, and so on. The support vector machine corresponding to the Gaussian kernel function is the Gaussian radial basis function (RBF), which is the most commonly used kernel function.

The RBF kernel function can extend the dimension to the space of infinite dimension, so it can meet all the needs of mapping theoretically. Why is it an infinite dimension? I didn't quite understand that before. Later, the teacher said that the RBF corresponds to the Taylor series expansion, in the Taylor series, a function can be decomposed into an infinite number of sums, where each item can be regarded as a corresponding dimension, so that the original function can be considered to be mapped to the infinite dimension of the space. Thus, in practical applications, the RBF is the relatively best choice. Of course, if you have research, you can choose other kernel functions, which may be better on some issues. However, RBF is a kernel function that has a good effect on the widest range of problems without understanding the problem. Therefore, the scope of use is also the widest.

In this way, the linear non-divided data can also be mapped to high-dimensional and even infinite-dimensional space by the RBF kernel function, and the problem can be solved by maximizing the calculation interval and relaxation variables. Of course, in the solution, there are some mathematical techniques to simplify the operation, for example, the use of Lagrange multipliers to transform the original problem into a dual problem, you can simplify the calculation. These are not used in the experiment, and the mathematical principle is a bit difficult, first of all.

A personal understanding of SVM

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More