[Reprint please indicate the source] http://www.cnblogs.com/jerrylead1 Introduction
SVM is basically the best supervised learning.Algorithm. When I first came into contact with SVM during the summer vacation last year, the teacher asked me to submit the "Statistical Learning Theory" report. At that time, I went online to get started with an introductory tutorial, which was very popular, at that time, I had a general understanding of some related concepts. The materials provided by Stanford this time allow me to re-learn some SVM knowledge. I think many Orthodox lectures are based on the VC Dimension Theory and the minimum structure risk principle, and then the SVM or something. Some materials are about classification hyperplane or something. Based on logistic regression described in the previous sections, this document introduces SVM, which not only reveals the relationship between models, but also makes people feel that the transition is more natural.
2. Review Logistic Regression
The objective of Logistic regression is to learn a 0/1 classification model from a feature, which uses a linear combination of features as an independent variable because the value range of the independent variable is negative infinity to positive infinity. Therefore, the independent variables are mapped to (0, 1) using the logistic function (or Sigmoid Function). The mapped value is considered to be a probability of Y = 1.
The formal representation is
Assume that the function
X is an n-dimensional feature vector, and function g is a logistic function.
The image is
We can see that the infinity is mapped to (0, 1 ).
Assume that the function is the probability that the feature belongs to y = 1.
When determining which class a new feature belongs to, we only need to. If it is greater than 0.5, It is a class of Y = 1, and vice versa.
Looking at it again, we found that it is only related,> 0. Then, g (z) is only used for ing, and the true decision of the category is still in place. And at that time, = 1, and vice versa = 0. If we only start from the beginning, we hope that the goal of the model is nothing more than to let the y = 1 feature in the training data, but the Y = 0 feature. Logistic regression is to be learned so that the features of the positive sample are much greater than 0, and those of the negative sample are much less than 0. It is emphasized that all training instances can achieve this goal.
The graphical representation is as follows:
The middle line is that logistic reviews emphasize that all points are as far away as possible from the middle line. The learning result is the middle line. Consider the preceding three points A, B, and C. We can determine that A is in the × class, but C is not very sure, and B is still able to determine. In this way, we can draw a conclusion that we should be more concerned with the points close to the intermediate split line, so that they can stay as far away as possible from the middle line, rather than reaching the optimal on all points. In that case, it is necessary to make some points close to the middle line in exchange for other points to stay farther away from the middle line. I think this is the difference between the idea of SVM and logistic regression. It is a local concern (not to care about the points that have been determined to be far away ), one consideration is global (a point that has been far away may be made farther away by adjusting the middle line ). This is my personal understanding.
3. Formal Representation
The result labels we used this time are y =-1, y = 1, replacing y = 0 and y = 1 used in logistic regression. Both W and B are replaced. Previously, it was considered. Now we replace it with B, and then with (that is ). In this way, let's proceed. That is to say, except that Y is changed from Y = 0 to y =-1, but it is marked differently. It is no different from the formal representation of Logistic regression. Then clarify the hypothesis Function
As mentioned in the previous section, we only need to consider positive and negative issues, rather than g (z). Therefore, we will simplify g (z) here, map it to y =-1 and y = 1. The ing relationship is as follows:
4. Functional margin and geometric margin)
Given a training sample, X is a feature, and Y is a result tag. I indicates the I-th sample. We define the function interval as follows:
We can imagine that in our g (z) definition, the value is actually. And vice versa. In order to maximize the function interval (more confidence to determine whether the example is a positive or a negative sample), it should be a large positive number at that time, and vice versa. Therefore, the function interval indicates whether the feature is positive or inverse.
Continue to consider W and B. If we increase W and B at the same time, for example, if we multiply the Coefficient above, for example, 2, the function interval of all vertices will increase by two times, this should not affect the problem solving, because what we need to solve is that the expansion of W and B at the same time has no impact on the results. In this way, we may need to add normalization conditions to limit W and B. After all, the goal of the solution is to determine the unique W and B, rather than multiple linear correlation vectors. This normalization will be considered later.
The function interval we just defined is for a sample. Now we define the function interval on the global sample.
To put it bluntly, the minimum function interval is used to classify positive and negative samples on the training samples.
Next, define the geometric interval. First, see the figure.
Suppose we have the split surface of point B. Any other point, such as the distance from A to this surface, is assumed that B is the projection of a on the split surface. We know that the direction of the vector Ba is (the gradient of the split surface), and the unit vector is. Point A is, so point B is X = (using the ry knowledge of junior high school,
For more information, see
It is actually the distance from point to plane.
In another more elegant way:
At that time, isn't it the function interval? Yes, the normalization result of the previously mentioned function interval is the geometric interval. Why are they the same? Because the function interval is defined by us, there is a geometric interval color during the definition. Similarly, if both W and B and W are expanded several times, the results are not affected. Defines the global geometric interval.
5. optimal margin Classifier)
In retrospect, we mentioned above that our goal is to find a superplane so that the point close to the superplane can have a larger gap. That is to say, we do not consider that all vertices must be far away from the superplane. We are concerned that the superplane can provide the maximum distance between the vertices closest to it. In terms of image, we regard the above figure as a piece of paper. We need to find a broken line. After the line is folded, the distance between the points closest to the broken line is larger than that of other broken lines. Formal Representation:
Here we use the = 1 W to make it a geometric interval.
At this point, we have defined the model. If W and B are obtained, then we can classify a feature X, which is called the optimal interval classifier. The next question is how to solve W and B.
Because it is not a convex function, we want to first handle the conversion and consider the relationship between the geometric interval and the function interval. Let's rewrite the formula above:
At this time, the maximum value we calculated is still the geometric interval, but w is not restricted at this time. However, at this time, the target function is still not a convex function and cannot be directly substituted into the optimization software for calculation. We need to rewrite it. As mentioned above, the expansion of W and B at the same time has no effect on the results, but we still require the true values of W and B, not a group of multiples of them. Therefore, we need to make some restrictions to ensure that our solution is unique. Here we take it for convenience. This means that the global function interval is defined as 1, that is, the distance from the point closest to the superplane is defined. Because the maximum value is equal to the minimum value, the result is:
This is good. It only has linear constraints and is a typical Quadratic Programming Problem (the objective function is the quadratic function of the independent variable ). It can be substituted into optimization software.
It is found that although the handout did not draw a picture like other handouts, the classification hyperplane is drawn, and the interval indicated on the chart is so intuitive, the derivation of each step is justified, export target functions and Constraints Based on the smoothness of ideas.
Next, we will introduce the manual solution, a better solution.