Machine learning Algorithm and Python Practice (c) Advanced support vector Machine (SVM)
Machine learning Algorithm and Python Practice (c) Advanced support vector Machine (SVM)
[Email protected]
Http://blog.csdn.net/zouxy09
Machine learning Algorithms and Python practice this series is mainly referring to the "machine learning Combat" this book. Because I want to learn python, and then want to understand some of the machine learning algorithms, so I want to use Python to implement a few more commonly used machine learning algorithms. Just to meet the same location of the book, so reference to the process of this book to learn.
In this section, we mainly review the support vector machine system and implement it through Python. Because of the content, it is divided into three blog posts. The first one is about SVM, the second is advanced, the whole knowledge chain of SVM is straightened, and the third chapter introduces the implementation of Python. SVM has a lot of very good blog posts, can refer to the references listed in this article and recommended reading materials. In this paper, the positioning lies in the integration of the whole of the SVM's overall knowledge chain straightening, so does not involve the deduction of details. The online commentary is very good deduction and a lot of books, we can further reference.
Directory
First, the introduction
Two, the linear can divide the SVM and the hard interval maximization
Three, dual optimization problem
3.1, Dual problem
3.2. Dual problem of SVM optimization
Four, relaxation vector and soft interval maximization
V. Nuclear function
Six, multi-class classification of SVM
6.1, "one-to-many" method
6.2, "one-to-one" approach
Analysis of conditions of KKT
The SMO algorithm for the implementation of SVM
8.1. Coordinate descent algorithm
8.2. SMO algorithm principle
8.3. Python implementation of SMO algorithm
Ix. References and recommended readings
V. Nuclear function
If our normal sample distribution is as shown on the left, it is normal to refer to, not as mentioned above, due to some stubborn outliers caused by the linearity of non-points. It is really linear and non-divided. The distribution of the sample itself is like this, and if, as in the case of a sample, a linear categorical boundary is pulled through the relaxation variable, it is obvious that the classification surface will be very bad. What about that? SVM is effective for linear data, and what is the best way to deal with it? It's time for a nuclear method (kernel trick).
As right, if we can transform our original sample point into another feature space, which is linearly divided in this feature space, then the above SVM can work easily. In other words, for non-divided data, now we need to do two things:
1) First use a nonlinear mapping φ (x) to transform all the original data x into another feature space, where the sample becomes linear and can be divided;
2) then use SVM in the feature space to learn the classification.
Well, the second job has nothing to say, just the same as the previous one. Who will do the first heavy work? How do we know which transformation can map our data to a linear divide? The data dimensions are so large that we can't see them. In addition, will this transformation make the second step of optimization complicated, and the computational amount is greater? For the first question, there is a well-known cover theorem: The nonlinear projection of complex pattern classification problems into high-dimensional spaces is more likely to be linearly divided than projected into low-dimensional spaces. OK, that's easy, we're going to find a map of all the samples mapped to the higher dimensional space. I'm sorry, but it's hard to find this mapping function. However, support vector machines do not directly look for and calculate this complex nonlinear transformation, but rather intelligently through a clever circuitous method to indirectly implement this transformation. It is a kernel function that not only has the ability to do so, but also does not add too much computational weight to both worlds. We can look back at the above SVM optimization problem:
It can be seen that the use of sample x is only calculated for the inner product of the samples I and J two.
For categorical decision functions, the inner product of two samples is also calculated. That is to say, training SVM and using SVM use the inner product of the sample, and only use the inner product. Well, if we can find a way to calculate the value of the inner product of two samples mapped to a high-dimensional space, we can. The nuclear function is to accomplish this great mission:
K (XI, XJ) =φ (XI) tφ (XJ)
The inner product Φ (XI) tφ (XJ) of the high dimensional space corresponding to two samples of XI and XJ is computed by a kernel function K (XI, XJ). Without knowing the change φ (x) is the person also. And this kernel function calculation is very simple, commonly used is radial basis RBF function:
At this point, the duality of our optimization becomes:
The only difference from the previous optimization problem is that the inner product of the sample needs to be replaced by a kernel function. There is no difference in the optimization process. And the decision function becomes:
That is, the new sample x and all our training samples compute the kernel function. It is important to note that since most of the sample's Lagrange factor like are 0, we just need to calculate a small number of training samples and the kernel function of the new sample, and then sum the symbols to complete the new sample x classification. The decision-making process of support vector machine can also be regarded as a process of similarity comparison. First, the input sample is compared with a series of template samples, the template sample is the support vector determined by the training process, and the similarity measure is the kernel function. After the sample is weighted and summed after the comparison of the support vectors, the weights are the like of the coefficients of each support vector and the grades of the class labels obtained during training. Finally, the decision is made according to the weighted sum value size. The use of different kernel functions is equivalent to the use of different methods of similarity measurement.
From the point of view of computation, no matter how high the spatial dimension of the φ (x) transformation is, or even the infinite dimension (the function is infinite dimension), the solution of this space's linear support vector machine can be carried out in the original space through the kernel function, so that the computation in the high dimensional space can be avoided. But the complexity of calculating kernel function and calculating the complexity of the inner product of the original sample have no substantial increase.
Come here, can't help but sigh a few sound. Why does "happen" SVM need to calculate the local data vectors are always in the form of an inner product? Why does "happen" exist to simplify the kernel function of inner product operation in the mapping space? Why does "happen" most of the samples contribute 0 to the decision-making boundary? ... Should thank God, or thank the vast and great scientific research workers Ah! Let me wait for the mortals to catch glimpses of such exquisite and incomparable mathematical beauty!
Here, the support vector machine related things are finished. Summing up: The basic idea of support vector machine can be summed up, firstly, the input space is transformed into a high dimensional space by the nonlinear transformation, then the optimal classification surface is the maximal interval classification surface in this new space, and the nonlinear transformation is realized by defining the appropriate inner product kernel function. SVM is actually based on statistical learning theory according to the principle of minimization of structural risk, which requires two purposes: 1) Two types of problems can be separated (minimum empirical risk) 2) margin maximization (minimum risk upper bounds) is not only a function of minimizing the empirical risk in the subset that guarantees the least risk.
Six, multi-class classification of SVM
SVM is a typical two-class classifier, that is, it only answers questions that belong to a positive class or a negative class. In reality, the problems to be solved are often many kinds of problems. So how do you get a multi-class classifier from a class two classifier?
6.1, "one-to-many" method
One-against-all This method is still relatively easy to think of. Is the problem of solving a two-class classification at a time. For example, in our 5 categories, for the first time, the sample of category 1 is identified as a positive sample, the rest of the 2,3,4,5 sample is set as a negative sample, so that a two-class classifier, it can indicate whether the type is or is not the 1th class; the second time we set the sample of category 2 as a positive sample, put 1,3,4, 5 of the samples together to be a negative sample, to get a classifier, so we can get 5 such two types of classifiers (always and the number of categories consistent). When we have a sample to classify, we take this sample and ask the classifier: Is it yours? Does it belong to you? Which classifier nods to say yes, the category of the article is OK. The benefit of this approach is that each optimization problem is small in size and fast to classify (only 5 classifiers are called to know the result). But sometimes there are two very embarrassing situations, such as taking this sample to ask a circle, each classifier said it belongs to its class, or each classifier said it is not its class, the former is called classification overlap phenomenon, the latter is called non-classification phenomenon. Classification overlap is OK to do, casually choose a result is not too outrageous, or see this article to the various super-plane distance, which far to award which. The non-classification phenomenon is really difficult, can only be divided into the 6th category ... What's more, the number of samples in the original categories is similar, but the "rest" of the sample number is always several times the positive class (because it is the sum of the samples other than the positive class), which artificially causes the "data set skew" problem described in the previous section.
As left. The red category separates the red from the other two, the green category separates the green from the other two, and the blue category separates the blue from the other two colors.
The classification of a point here is actually measured by the distance between this point and the three decision boundaries, because the greater the distance to the classification surface, the more credible the classification. Of course, this distance is signed, as shown below:
For example, left, the point of the star is divided into green this category. The figure on the right divides the star into Brown.
6.2, "one-to-one" approach
The One-against-one method is to select a sample of a class each time to make a positive class sample, and the negative class sample becomes only select a class (called "one-to-two" method, oh, no, no singled out, is "one-to-one" method, hehe), which avoids skew. So the process is to figure out such a number of classifiers, the first answer is "1th class or 2nd Class", the second only answer "is 1th class or 3rd Class", the third only answer "is 1th class or 4th class", so go on, you can also immediately arrive, such classifier should have 5 X 4/2=10 (general formula is, If there are K categories, the total number of the two categories of classifiers is K (k-1)/2). Although the number of classifiers is many, but the total time spent in the training phase (i.e. the classification plane of these classifiers) was much less than the "one-to-the-rest" approach, when it was really used to classify a sample to all classifiers, the first classifier would vote to say "1" or "2", and the second would say it was " 1 "or" 3 ", let each one vote on their own vote, the final count of votes, if the category" 1 "votes the most, the sentence is classified as 1th category. This approach obviously has overlapping classifications, but there is no non-categorization, since it is not always possible for all categories of votes to be 0. Like right, the middle purple block, each kind of votes is 1, that does not know to classify to that kind of good, can only throw to a certain class (or measure this point to the distance of three decision boundary, because the distance to the classification surface is bigger, the classification is more credible), throw away is your life good, throw wrong not lucky.
Analysis of conditions of KKT
For kkt conditions, please refer to the literature [13][14]. Let's say that the optimal solution we have optimized is:αi*,βi*,ξi*, w* and b*. Our optimal solution needs to meet the KKT conditions:
Both βi* and ξi* need to be greater than or equal to 0, and αi* need to be between 0 and C. That can be discussed in three different situations:
On the whole, the kkt condition becomes:
The first formula indicates that if like= 0, then the sample falls outside the two interval line. The second formula shows that if like=c, then the sample may fall within the two interval line, it may also fall on the two interval line above, mainly to see the corresponding relaxation variable value is equal to 0 or greater than 0, the third formula indicates if 0<like<c, Then the sample must fall on the dividing line (this is important, B is to take these points on the dividing line to ask for, because in the split line wTx+b=1 or-1, is the equation, in other places, is an inequality, can not solve B). The representation of the specific visualization is as follows:
The KKT condition shows that thelike is not equal to 0 of the support vector, it may fall on the divider line, it may also fall within the two divider line. Kkt condition is very important, in the SMO is one of SVM's implementation algorithm, we can see its important application.
Machine learning Algorithms and Python Practice (c) Advanced support vector Machine (SVM)