Introduction to SVM (10) use SVM for multiclass classification

Last Update:2018-12-04 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Link: SVM (10) use SVM for multiclass classification

From the SVM images, we can see that SVM is a typical classifier of two types, that is, it only answers questions of positive or negative type. In reality, the problem to be solved is often caused by multiple types of problems (a few exceptions, such as spam filtering, only need to determine whether "yes" or "not" spam ), for example, text classification, such as digital recognition. How to obtain multiple classifiers from two types of classifiers is a question worth studying.

Taking text classification as an example, there are many ready-made methods, one of which is to consider all samples at a time and solve the optimization problem of a multi-objective function, you can obtain multiple classification faces at a time, as shown in the following figure:

Multiple hyperplanes divide the space into multiple regions. Each region corresponds to a category. For an article, you can see the region in which the space falls.

Looks pretty, right? It is a pity that this algorithm is still basically on paper, because the calculation of one-time solution is too large to be practical.

A few steps back, we will think of the so-called "one class for the rest" method, that is, each time we still solve a problem of two categories. For example, we have five categories. For the first time, we set the sample of Class 1 as a positive sample, and the samples of the remaining 2, 3, and 4 as negative samples. In this way, we can obtain a classifier of two categories, it can point out whether an article is not of the 1st category; the second time we set the sample of category 2 as a positive sample, and set the samples of, as a negative sample to get a classifier, in this case, we can get five of these two types of classifier (always consistent with the number of classes ). When there is an article that requires classification, we will take this article as a classifier and ask: is it yours? Does it belong to you? If the classifier says yes, the document category is determined. The advantage of this method is that the scale of each optimization problem is relatively small, and the classification speed is very fast (you only need to call five classifiers to know the results ). But sometimes there are two very embarrassing situations. For example, if you take an article and ask a circle, each classifier says it belongs to its category, or each classifier says that it is not in its category. The former is called classification overlap, and the latter is called non-classification. It's easy to choose a result without overlapping categories. You can also decide the distance from this article to the ultra-plane. The unclassified phenomenon is really hard to do. It can only be assigned to 6th categories ...... Even worse, the number of samples in each category is similar, however, the number of samples in the other class is always several times that of the positive class (because it is the sum of samples in other classes except the positive class ), this artificially caused the "Data Set skew" mentioned in the previous section.

Therefore, we have to take another step back to solve the problem of two types of classification, or to select a class sample each time as a positive class sample, the negative class sample is changed to selecting only one class (called the one-to-one method ), this avoids skew. Therefore, the process is to calculate such classifier. The first one only answers "is 1st class or 2nd Class", and the second one only answers "is 1st class or 3rd class ", the third answer is "1st or 4th". In this case, you can immediately conclude that such classifier should have 5x4/2 = 10 (the general expression is, if there are K classes, the total number of two classes of classifier is K (k-1)/2 ). Although the number of classifiers is large, the total time used in the training phase (that is, when the classification plane of these classifiers is calculated) is much less than that in the "one-class-to-Other" method, when it is used for classification, I threw an article to all classifiers. The first classifier will vote to say it is "1" or "2 ", the second one will say it is "1" or "3", so that each of them will vote for their own votes, and finally count the number of votes. If the class "1" gets the most votes, this article belongs to 1st categories. This method will obviously overlap categories, but it will not be unclassified, because the total number of votes for all categories cannot be zero. Does it look good? Actually, if we think about classification, how many classifiers have we called? 10. When the number of categories is 5, if the number of categories is 1000, the number of classifiers to be called will rise to about 500,000 (the square magnitude of the number of categories ). How is this good?

It seems that we have to take another step back. When classifying, we are still training like a one-to-one method, just before classifying an article, we first organize the classifier according to the figure below (as you can see, this is a directed acyclic graph, so this method is also called Dag SVM)

In this way, we can first ask the classifier "1 to 5" (meaning it can answer "1st or 5th"). If it answers 5, let's go to the left and then ask the classifier "2 to 5". If the classifier is still "5", we will continue to go to the left so that we can continue to ask for the classification result. What are the benefits? In fact, we only call the four classifiers (if the number of classes is K, then only call the k-1), classification speed is fast, and there is no classification overlap and non-classification phenomenon! What are the disadvantages? Assume that the classifier at the beginning has an incorrect answer (this is clearly an article in category 1, which is called 5 ), then the classifier behind it cannot be corrected in any way (because the classifier behind it does not have the class label "1 ), in fact, this error accumulates downward for the classifier at each layer below ..

However, do not be intimidated by the error accumulation of the Dag method. Errors also exist in one-to-the-other and one-to-one methods. The advantage of the Dag method is that the upper limit of the accumulation, whether large or small, it is always conclusive and theoretically proven. In the one-to-Other and one-to-one method, although the generalization error limit of each of the two classifiers is known, when we combine multiple categories, the upper limit of the error is unknown, this means that the accuracy is also possible to reach 0, which is depressing.

In addition, the selection of the root node of the Dag method (that is, how to select the first classifier to participate in the classification) also has some methods to improve the overall effect. We always hope that the root node will make fewer mistakes, therefore, it is recommended that the two categories involved in the first classification have a very large difference, so that they are unlikely to be divided incorrectly; or we can always use the classifier with the highest accuracy in the two categories as the root node, or let the two classifiers not only output the label of the category during classification, it also outputs something similar to "Confidence Level". When it is not confident in its own results, we will not only follow its output, but also take the path next to it, and so on.

Large tips: SVM computing complexity

When SVM is used for classification, it is actually a completely different process of training and classification. Therefore, we cannot generalize the complexity of the discussion. What we refer to here is the complexity of the training stage, that is, the complexity of solving the second planning problem. The solution to this problem is basically divided into two major sections, the analytical solution and the numerical solution.

The analytical solution is a theoretical solution. It is in the form of an expression, so it is accurate. As long as there is a solution to a problem (what is the problem that has no solution followed? haha ), then its resolution must exist. Of course, existence is one thing. It is another thing that can be solved, or can be solved within a tolerable time range. For SVM, the time complexity for obtaining the resolution solution is the worst, which can reach O (N_{SV^{3) where n_{SV is the number of SVM. Although there is no fixed proportion, the number of SVM is also related to the size of the training set.}}}

A numerical solution is a usable solution. It is a number, which is usually an approximate solution. The process of obtaining the numerical solution is very similar to the exhaustive method. From the beginning, try it. When the solution does not meet certain conditions (called the shutdown condition, it is deemed that the solution is accurate enough to satisfy this condition, if you do not need to continue to calculate the number, try again. Of course, the next number is not randomly selected, and there are rules to follow. Some algorithms try only one number at a time, and some try more than one, and the methods for finding the next number (or the next number) are also different, And the downtime conditions are also different, the final solution accuracy is also different. We can see that the complexity of the numerical solution cannot be discussed without specific algorithms.

A specific algorithm, Bunch-Kaufman training algorithm, typical time complexity in O (N_{SV^{3 + ln_{SV^{2 + DLN_{SV) and O (DL^{2), where N_{SV is the number of SVM, L is the number of samples in the training set, and D is the dimension of each sample (the original dimension, which is not before ing to the high dimension space ). Complexity will change because it is not only related to the size of the input question (not light and the number of samples, dimension), but also to the final solution of the question (that is, the Support Vector ), if the number of SVM is small, the process will be much faster. If there are many SVM close to the number of samples, O (DL^{2) This is a very bad result (10,000 samples, each of which has 1000 dimensions, so you don't have to worry about it, the input size is too normal for text classification ).}}}}}}}}

In this way, you will see why the total time of one-to-one method is actually less than that of other methods, even though the number of classifier types to be trained is large, because one of the other methods takes all samples into account for each training (only dividing different parts into positive or negative classes each time), it is naturally much slower.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More