10:SVM of SVM Theory and experiment--Multi-Class classification

10:SVM of SVM Theory and experiment--Multi-Class classification _ Xu Haihui

Last Update:2018-08-23 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Dr. Xu Haihui Teaching.

From the several graphs of SVM, it can be seen that SVM is a typical two-class classifier, that is, it only answers questions that belong to a positive class or a negative one. In reality, the problem to be solved is often multiple types of problems (a few exceptions, such as spam filtering, just need to determine "yes" or "not" spam), such as text categorization, such as digital recognition. How to get multiple classifiers from class Two classifier is a problem worth studying.

In the case of text categorization, there are a number of ready-made methods, one of which is once and for all to consider all the samples, and to solve a multi-objective function optimization problem, get multiple categories at once, just like the following figure:

Multiple hyperplane divides space into multiple regions, each region corresponding to a category, to an article, to see which area it falls in to know its classification.

It looks beautiful, that's right. Unfortunately, this algorithm is still basically on paper, because the method of one-time solution calculation is too large, too large to be practical.

A little step back, we will think of the so-called "a class of the rest" approach, is to solve the problem of two classes of classification at a time. For example, we have 5 categories, and for the first time we make a positive sample of Class 1, the remaining 2,3,4,5 samples are combined to be negative, so that a two-class classifier can be identified to indicate whether an article is or is not a 1th class; the second time we set the sample of category 2 as a positive sample, the 1,3,4, 5 of the samples are combined to be a negative sample, to get a classifier, so that we can get 5 of these two types of classifiers (always consistent with the number of categories). To the an article need to classify the time, we take this article a classifier of the question: it belongs to you. It belongs to you. Which classifier nods to say yes, the category of the article is OK. The advantage of this approach is that each optimization problem is of a smaller scale, and the classification is fast (only 5 classifiers are required to know the result). But sometimes there are two very awkward situations, for example, take an article to ask a circle, each classifier said it belongs to its category, or each classifier said it is not its category, the former is called classification overlap phenomenon, the latter is called an unclassified phenomenon. Classification overlap is fine, choose a result is not too outrageous, or to see this article to each of the hyperplane distance, which far from the award to which. The phenomenon of not classification is really difficult, can only be divided into the 6th category ... More importantly, the number of samples in each category is similar, but the "rest" of that type of sample is always several times the positive class (because it is in addition to the normal class of other categories of the sum of samples), which artificially caused by the previous section of the "Data set skew" problem.

So we have to step back, or solve two types of classification problems, or a sample of a class each time to make a positive class sample, and the negative class sample becomes only select a class (called "one-on-one" method, oh, no, no single, is "one-on-one" method, hehe), which avoids deflection. So the process is to work out these classifiers, the first answer is "1th or 2nd Class", the second answer is "class 1th or Class 3rd", and the third answer is "class 1th or 4th", so you can immediately conclude that the classifier should have 5 X 4/2=10 (the formula is, If there are K categories, the total number of two categories of classifiers is K (k-1)/2). Although the number of classifiers is many, the total time spent in the training phase (i.e., when the classifier is calculated) is much less than the "one to the rest" approach. When it is really used to classify, throw an article to all classifiers, the first classifier will vote that it is "1" or "2", and the second will say it is " 1 "or" 3 ", so that each one vote their own vote, the final count of votes, if the category" 1 "to get the most votes, the sentence of this article belongs to the 1th category. There would obviously be overlapping of classifications, but there would not be an impossibility, since it was impossible to have 0 votes in all categories. Looks good enough. Actually, think about classifying an article, how many classifiers we call. 10, this is still the number of categories 5, the number of categories if it is 1000, the number of classifiers to call will rise to about 500,000 (the square of the number of categories). How it is good.

It seems we have to step back and work on the classification. We're still training like a one-to-one approach, but before we categorize an article, we organize the classifier (as you can see, this is a forward-free graph, so this is also called Dag SVM), as shown below.

So in sorting, we can ask the classifier "1 to 5" (meaning it can answer "class 1th or Class 5th"), if it answers 5, we go left, and then ask "2 to 5" This classifier, if it said "5", we continue to go left, so keep asking, we can get the classification results. Good to be. We actually only call 4 classifiers (if the number of categories is k, only call k-1), the classification speed is very fast, and there is no overlapping and not classification phenomenon. What's the downside? If the first classifier answers the error (clearly a Category 1 article, it is said to be 5, then the next classifier is no way to correct its error (since the following classifier does not appear at all "1" of this category label), in fact, for each of the following levels of the classifier has this error accumulation of the phenomenon.

However, do not be intimidated by the Dag method, the accumulation of errors in a pair of other and one-to-one methods are also present, Dag method is better than their place is that the cumulative ceiling, whether large or small, is always conclusive, there are theoretical proof. And one of the remaining and one-to-one methods, although the generalization error limits for each of the two classes of classifiers are known, but when it comes to combining multiple categories, the error bounds are how much, nobody knows, which means that the accuracy rate of 0 is also possible, which is more depressing.

And now Dag method root node selection (that is, how to select the first to participate in the classification of the classifier), there are also ways to improve the overall effect, and we always want the root node to make fewer mistakes, so the two categories that are involved in the first classification are, ideally, so different that they are too big to be wrong. Or we'll always take the root of the classifier with the highest correct rate in both categories, or we let the two categories of classifiers in the classification, not only the output category of the label, but also output a similar "confidence", when it is not too confident of their results, we will not only follow its output to go, Take the road next to it and take a walk, and so on.

Using SVM to classify, is actually the training and classification of two completely different processes, so the discussion of complexity can not be generalized, we are talking here is mainly the complexity of the training phase, that is, the complexity of solving the two-time programming problem. The solution to this problem is basically divided into two large chunks, analytic solution and numerical solutions.

Analytic solution is the theoretical solution, its form is the expression, so it is accurate, a problem as long as there is solution (no solution of the problem is also involved in what ah, haha), then its analytic solution is certain to exist. Of course it is one thing to be able to solve it, or it can be solved in a tolerable time range, which is another matter. For SVM, the time complexity of solving the analytic solution can be reached O (Nsv3), in which NSV is the number of support vectors, although there is no fixed proportion, but the number of support vectors is also related to the size of the training set.

A numerical solution is a solution that can be used, a number of one, and often an approximate solution. The process of finding a numerical solution is very much like an exhaustive method, starting with a number try it. When the solution effect, do not meet certain conditions (called the stop condition, is to meet this later think the solution is enough accurate, do not need to continue to count down) on the next, of course, the next number is not random, there are certain ways to follow. Some algorithms, try only one number at a time, some try multiple, and find the next number (or the next group of numbers) methods are also different, the downtime conditions are different, the final solution accuracy is also different, visible to the complexity of the solution of the discussion can not be off the specific algorithm.

A specific algorithm, Bunch-kaufman training algorithm, typical time complexity between O (NSV3+LNSV2+DLNSV) and O (dL2), where NSV is the number of support vectors, L is the number of training set samples, D is the number of dimensions of each sample (the original dimension, The number of dimensions before the mapping to the high-dimensional space is not passed. Complexity changes because it's not just about the size of the input problem (not just the number of samples, dimension is related to the final solution of the problem (that is, support vector-related), if the support vector is less, the process will be much faster, if the support vector is many, close to the number of samples, will produce O (dL2) This very bad result (give 10,000 samples, each sample 1000 dimensions, basically do not have to calculate, does not come out, hehe, and this type of input is too normal for text categorization).

　　Looking back, you'll see why a one-to-one approach, despite the number of two classes of classifiers to train, but the total time is actually less than a couple of other methods, because each of the remaining methods takes all the samples into consideration (just each time dividing the different parts into positive or negative), naturally it is much slower.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More