Support Vector Machine SVM (i) "Reprint please specify source" Http://www.cnblogs.com/jerrylead1 Introduction
Support Vector machine is basically the best supervised learning algorithm. The first contact with SVM was last summer, when the teacher asked to pay the "Statistical learning theory" report, then went online to a primer tutorial, which is very popular, at that time, just a general understanding of some related concepts. The learning materials offered by Stanford have allowed me to re-learn some of the SVM knowledge. I see a lot of orthodox lectures are from the VC dimension theory and the structure of the minimum principle of risk, and then lead to the SVM, and some of the information on the classification of the super-plane or something. This material from the previous several sections of the logistic regression, the derivation of SVM, not only reveals the relationship between the model, but also makes people feel the transition more natural.
2 re-examining logistic regression
The purpose of logistic regression is to learn a 0/1 classification model from features, which is a linear combination of attributes as an independent variable, since the value range of the independent variable is negative infinity to positive infinity. Therefore, using a logistic function (or sigmoid function) to map an argument to (0,1), the mapped value is considered to be the probability of belonging to Y=1.
Formal representation is
Assumption function
where x is the n-dimensional eigenvector, the function g is the logistic function.
The image is
As you can see, the infinity is mapped to (0,1).
The assumption function is the probability that the characteristic belongs to Y=1.
When we want to distinguish a new feature belongs to which class, only needs, if greater than 0.5 is Y=1 class, and vice versa belongs to the Y=0 class.
A second look, found only and related, >0, then, G (z) is only used to map, the real category of decision-making right. and then, = 1, and vice versa = 0. If we only start from, hope the model achieves the goal is to let the training data y=1 characteristics, but y=0 characteristics. Logistic regression is to learn, so that the characteristics of the positive case is far greater than 0, the characteristics of negative examples is far less than 0, emphasizing in all training instances to achieve this goal.
The graphical representation is as follows:
The middle line is that the logistic review emphasizes all points as far as possible from the middle line. The result of learning is also the middle line. Consider the above 3 points A, B and C. We can be sure that a is of the X category, but c we are not sure, B can be determined. So we can conclude that we should be more concerned with the points near the middle dividing line so that they are as far away from the middle line as possible, rather than at all points. In that case, make part of the point close to the middle line in exchange for another point farther away from the middle line. I think this is the idea of support vector machines and the different points of logistic regression, a consideration of the local (do not care about the point that has been determined away from), a consideration of the global (already away from the point may be adjusted by adjusting the middle line so that it can be farther away). This is my personal intuitive understanding.
3 Formal representations
The result label we use this time is Y=-1,y=1, replacing the y=0 and Y=1 used in logistic regression. They will also be replaced by W and B. Before, which thought. Now we replace with B, followed by (i.e.). In this way, we let, further. In other words, except that Y is changed from y=0 to Y=-1, it is not different from the formal representation of logistic regression. And then explicitly assume the function
The previous section mentions the positive and negative issues that we only need to consider, rather than the G (z), so we'll make a simplification of G (z) to simply map it to y=-1 and Y=1. The mapping relationship is as follows:
4 function interval (functional margin) and geometric interval (geometric margin)
Given a training sample, X is a feature and Y is the result label. I represents the first sample. We define the function intervals as follows:
It is conceivable that at that time, in our g (z) definition, the value is actually. Vice versa. In order to make the function interval maximum (greater confidence to determine whether the example is a positive or a inverse), at that time, it should be a large positive number, and vice versa is a large negative number. Therefore, the function interval represents whether we consider the feature to be a positive or a counter-example of true reliability.
Continue to consider W and B, if you increase both W and B, for example, in front of a factor such as 2, then the function interval of all points will be increased by twice times, this to solve the problem should not have an impact, because we have to solve is that while expanding W and B have no effect on the result. Thus, in order to limit W and B, we may need to add a normalization condition, after all, the goal of the solution is to determine the only one W and B, rather than a group of linearly related vectors. This normalization will be considered later.
The function interval we have just defined is for a sample, and now we define the function interval on the global sample.
To be blunt is to classify the positive and negative cases of the training sample as the minimum function interval.
Next define the geometry interval, first look at the diagram
Let's say we have the split plane where B points are. Any other point, such as the distance from a to the surface, indicates that B is the projection of a on the split plane. We know that the direction of the vector BA is (the gradient of the split plane), which is the unit vector. Point A is, so the B point is x= (using the geometrical knowledge of junior middle school), brought in,
Further to get
is actually the point to the plane distance.
Another more elegant way of writing:
At that time, is not the function interval? Yes, the function interval normalization result mentioned earlier is the geometric interval. Why would they be the same? Because the function interval is defined by us, there is a geometric interval of color at the time of definition. Similarly, enlarging W and b,w enlarged several times, enlarging several times, the result has no effect. Also defines the global geometric interval
5 optimal interval classifier (optimal margin classifier)
In retrospect, we mentioned that our goal is to look for a super plane, so that the points closer to the super plane can have greater spacing. That is, we do not consider that all the points must be away from the hyper-plane, we are concerned about the super plane can make all points from its nearest point has the maximum spacing. Image, we think of the above image as a piece of paper, we want to find a polyline, according to this polyline folding, the closest point to the polyline than the other polyline is larger. Formal representations are:
Here with the =1 Statute W, which makes the geometrical interval.
Here, we have defined the model. If W and b are obtained, then a feature x, we can classify, called the optimal interval classifier. The question that follows is how to solve the problem of W and B.
Since it is not a convex function, we would like to deal with the transformation first, considering the relationship between geometric interval and function interval, we rewrite the above equation:
In fact, the maximum value we ask for is still the geometric interval, but the W is not constrained at this time. However, this time the objective function is still not a convex function and cannot be calculated directly into the optimization software. We're going to rewrite it. Before we say that enlarging W and B has no effect on the result, but what we are asking for is still the determined values of W and B, not their set of multiples, so we need to make some restrictions to ensure that our solution is unique. Here for the sake of simplicity we take. The implication is that the global function interval is defined as 1, that is, the distance from the closest point of the hyper-plane is defined. Since the maximum value is equal to the minimum value, the result of the rewrite is:
Well, it's a linear constraint, and it's a typical two-time programming problem (the objective function is the two-time function of the argument). Generation optimization software is solvable.
To find here, although this handout does not like other handouts first draw a good picture, draw a good classification of the super-plane, on the graph marked the interval so intuitive, but each step derivation of the reasonable, relying on the flow of ideas to derive the objective function and constraints.
The next is the manual solution method, a better solution.
"Reprint" Support Vector machine SVM (i.)