1. linear classifier:
First, a very, very simple classification problem (linear differentiation) is given)We need to use a straight line to separate the black and white points. Obviously, this line on the figure is one of the lines we require (there can be no number of such lines)
Let us say that we make the Black Point =-1, the white point = + 1, and the straight line f (x) = W. X + B. Here, X and W are vectors. In fact, this form is equivalent to f (x) = w1x1 + w2x2... + Wnxn + B. When the dimension of vector X is 2, F (x) indicates a straight line in two-dimensional space. When the dimension of X is 3, f (x) it indicates a plane in a three-dimensional space. When the dimension of X is n> 3, it indicates the n-1 hyperplane in an n-dimensional space. These are relatively basic content. If you are not clear about them, you may need to review the contents of calculus and linear algebra.
As we have just said, we set the black and white vertices to + 1 and-1 respectively. So when there is a new vertex x that needs to predict which category it belongs, we can use SGN (f (x) to predict. SGN represents a symbolic function. When f (x) is greater than 0, SGN (f (x )) = + 1. When F (x) <0, SGN (f (x) =-1.
However, how can we obtain an optimal division line f (x? Number of Possible f (x)
A very intuitive feeling is that this line is the farthest from the nearest point in the given sample. This sentence is a bit difficult to read. Below are some figures to illustrate:
Method 1:
Method 2:
Which of the two methods is better? Intuitively, the larger the gap, the better, and the better the points of the two categories. Just as we usually judge whether a person is a man or a woman, it is very difficult to make a mistake. This is caused by the gap between the male and female categories, this allows us to classify data more accurately.In SVM, maximum marginal is one of the theoretical foundations of SVM.There are many reasons to choose the function that maximizes the gap as the split plane. For example, from the probability perspective, it is to make the point with the minimum confidence level the maximum confidence level (which sounds very difficult ), from the perspective of practice, the effect is very good. I will not discuss it here. As a conclusion, it will be OK ,:)
The points drawn out by red and blue coils are the so-called support vector ).
It is a description of the gap in the previously mentioned category. Classifier boundary is f (x), and the red and blue lines (plus plane and minus plane) are the faces of the Support Vector, the gap between the red and blue lines is the gap between the categories to be maximized.
The m formula is provided here: (it is easy to obtain the resolution Ry from the high school, or refer to Moore's PPT later)
In addition, the support vector is located in a straight line between wx + B = 1 and wx + B =-1. we multiply the class y to which this point belongs (remember? If y is not + 1 or-1), the expression of the support vector is Y (wx + B) = 1, so that the support vector can be expressed more simply.
When the support vector is determined, the split function is determined. The two problems are equivalent. To get the support vector, another function is to make those points behind the support vector do not need to be involved in the calculation. This will be explained in more detail later.
At the end of this section, we provide the expressions for optimization:
| W | it refers to the second norm of W. the denominator of the above M expression means that m = 2/| w |, maximization is equivalent to minimization | w |, And because | w | is a monotonic function, we can add square to it, and the preceding coefficient, it should be easy for familiar students to see it. This formula is for convenience of guidance.
There are some restrictions for this formula. The complete writing should be like this :(Original problem)
S. t means subject to, that is, the meaning under the following restriction conditions. This word is very easy to see in SVM papers. This is actually a constrained Quadratic Programming (qP) problem. It is a convex problem. A convex problem means that there is no local optimal solution. You can imagine a funnel, no matter where we put a ball in the funnel at the beginning, the ball will eventually fall out of the funnel to obtain the global optimal solution. The restriction conditions behind s.t. can be seen as a convex polygon. What we need to do is to find the optimal solution in this convex polygon. These questions are not discussed here, because a book cannot be written. If you have any questions, see Wikipedia.
Ii. convert it to a dual problem and optimize the solution:
This optimization problem can be solved by using the Laplace multiplier method, and the theory of the kkt condition is used. Here, we will directly develop the objective functions of the formula:
The process of solving this formula requires the related knowledge of the pair (In addition, the pluskid also has an article dedicated to this problem), and there is a certain formula for derivation. If you are not interested,You can jump directly to the backUseBlue FormulaThis section mainly references the article from plukids.
First, Let L minimize W and B, respectively set the partial derivative of L about W and B to 0.Original problemAn expression
Take the two formula back to L (W, B, A) to obtain the expression of the dual problem.
When a new problem is added, the condition is (Dual Problem):
This is the formula we need to optimize. So far,We have obtained the optimized formula for the linear severable problem..
There are many ways to solve this formula, such as SMO. I personally think that solving such a Constrained Convex Optimization Problem is quite independent from obtaining this convex optimization problem, therefore, the preparation in this article does not involve how to solve this topic at all. If you have time later, you can make up the previous article To Talk About It :).
3. Cases where linear division is not possible (soft interval ):
Next, let's talk about linear division, becauseThis assumption of linear differentiation is too limited.Now:
It is a typical linear classification chart. We cannot use a straight line to divide it into two areas. Each area contains only one color point.
There are two methods for Classifier in this case,One is to use CurvesTo completely separate them, a curve isNon-linearAs mentioned laterCore functionsThere is a certain relationship:
The other method is to use a straight line, but it does not need to be guaranteed.That is, to tolerate those error points, but we have to add the penalty function so that the more reasonable the error points, the better. In fact, in many cases, the more perfect the classification function is not during training, the better, because some data in the training function is inherently noisy. It may be wrong when the classification label is manually added, if we have learned these error points during training (learning), the model will inevitably make mistakes the next time we encounter these errors (if the teacher gives you a lecture, if a knowledge point is wrong and you believe it is true, mistakes will inevitably occur during the exam ). The process of learning "noise" is an over-fitting (over-fitting). This is a taboo in machine learning. We would rather learn less, and resolutely put an end to learning more wrong knowledge. Back to the topic, how to use a straight line to separate the points that are not linear:
We can add a penalty for the points that are divided into errors.Penalty FunctionYesThe distance from this point to its correct position:
In the middle, the blue and red lines are the boundary of the support vector, the Green Line is the decision function, and the purple linesIndicates the distance from the faulty point to the corresponding decision surface.In this way, we can add a penalty function on the original function, with the following constraints:
In the formula, the blue part is the penalty function that is added on the basis of the linear differentiation problem. When Xi is on the correct side, ε = 0 and R are the number of all vertices, C is a coefficient specified by the user, which indicates the penalty for the number of points to be divided into errors. When C is very large, the number of points to be divided into errors will be less, however, the case of over-fitting may be serious. When C is very small, there may be many points of error, but the resulting model may be incorrect, therefore, there is a lot of knowledge about how to choose C, but in most cases, it is obtained through experience.
The following is the same: to solve a dual problem of the Langran system, obtain the expression of the dual problem of the original problem:
The blue part is different from the dual problem expression that can be linearly divided. The dual problem obtained when linear division is not possible. The difference is that α ranges from [0, + ∞) to [0, C]. the increased penalty ε does not increase complexity for dual problems.
Iv. core functions:
Just now, when talking about the inseparable situation, I mentioned that if some non-linear methods are used, we can get the curves that divide the two categories perfectly, such as the kernel function to be discussed next.
We canChange the space from the original linear space to a higher dimensional space.,In this high-dimensional linear space, a hyperplane is used for Division.. Here is an example to understand how to use the spatial dimension to help us classify (examples and the image's kernel function section from pluskid ):
It is a typical linear inseparable situation.
However, when we map the two vertices similar to the elliptical shape to a high-dimensional space, the ing function is:
Using this function, point in the plane can be mapped to a three-dimensional space (Z1, Z2, Z3 ), after rotating the ing coordinates, we can get a linear and segmented point set.
In another philosophical example, there are no two identical objects in the world. For all two objects, we can add dimensions to make them eventually different. For example, the two books, the two dimensions (color and content) are the same. We can addAuthorThis dimension cannot be added.Page number, Can joinOwner, Can joinPurchase location, Can joinNote contentAnd so on.When a dimension is added to an infinite dimension, any two objects can be divided..
Recall the dual problem expression:
We can transform the red part:
What this sub-statement does is to map a linear space to a high-dimensional space. There are many k (x, XJ) types. Below are two typical types:
The kernel above is called a polynomial kernel. The kernel below is called a Gaussian Kernel. The Gaussian Kernel even maps the original space to an infinite dimension space. In addition, the kernel function has some good properties, for example, it does not increase much additional computing workload than linear conditions. For a problem, different kernel functions may produce different results. Generally, you need to try to obtain the result.
5. Some other problems:
1) how to perform multiclass classification:
The classification mentioned above is a binary classification. When n is classified, there are two main methods: 1 vs (n-1) one is 1 vs 1. In the previous method, we need to train n classifiers, the I classifier is to see whether it belongs to category I or belongs to the complement set of category I (out of the N-1 classification of I ).
In the latter method, we need to train N * (N-1)/two classifiers. The classifier (I, j) can determine whether a point belongs to I or J.
This processing method is not only used in SVM, but also widely used in many other categories. According to Professor Lin (author of libsvm, 1 vs 1 is better than 1 vs (n-1 ).
2) Does SVM overfitting?
SVM avoids overfitting. One is to adjust C in the previously mentioned penalty function, and the other is to look at the formula. Is min | w | ^ 2 familiar? We have seen this formula in the least squares regression. This formula can make the function smoother, so SVM is not easy to over-fitting.
Reference:
The main reference documents are from four places: Wikipedia (Hyperlink has been provided in the article) and pluskid's blog about SVM, andrew Moore's ppt (many pictures in the article are referenced or modified from Andrew Moore's PPT, and PRML
Support Vector Machine (SVM) Basics