Getting started with text classification (6)

Last Update:2018-12-07 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

SVM algorithms
Support Vector Machine (SVM) was first proposed by Cortes and Vapnik in 1995, it has many unique advantages in solving small samples, non-linear and high-dimensional pattern recognition, and can be applied to function fitting and other machine learning problems [10].
the SVM method is based on the statistical learning theory based on VC theory and minimum structural risk principle, based on the limited sample information, we can find the best compromise between the complexity of the model (that is, the learning accuracy of specific training samples, accuracy) and the learning ability (that is, the ability to identify arbitrary samples without error, in order to obtain the best promotion capability [14] (or generalized ability ).
the SVM method has a solid theoretical foundation. The essence of SVM training is to solve a Quadratic Programming Problem (quadruple programming, which indicates that the target function is a quadratic function, the constraint condition is the optimization problem of Linear Constraints). The global optimal solution is obtained, which makes it superior to other statistical learning techniques. SVM classifier performs well in text classification and is one of the best classifiers. kernel function is used to convert the original sample space to a high-dimensional space, it can solve the problem of Linear Non-segmentation of original samples. The disadvantage is that the selection of kernel functions lacks guidance and it is difficult to select the best kernel function based on specific problems. In addition, the SVM training speed is greatly affected by the scale of the training set, resulting in high computing overhead, to solve the problem of SVM training speed, researchers have proposed many improvement methods, including the chunking method, Osuna algorithm, SMO algorithm, and interactive SVM [14].
the advantages of SVM classifier are good universality, high classification accuracy, fast classification speed, and irrelevant to the number of training samples, this method is superior to KNN And Naive Bayes in terms of accuracy and accuracy.
compared with other algorithms, the theoretical basis of SVM algorithms is complex, but it has broad application prospects, I plan to write a series of articles to discuss SVM algorithms in detail, stay tuned!

After introducing several representative algorithms, we may wish to compare their advantages and disadvantages with several groups of experimental data at home and abroad.
In the experiment of Chinese corpus, the document [6] uses the benchmark corpus provided by the Natural Language Processing Laboratory of Fudan University to test several classification algorithms based on the word Vector Space text model, this benchmark corpus is divided into 20 categories, with a total of 9804 training documents and 9833 testing documents. After unified word segmentation and noise word elimination, the performance indicators of each classification method are as follows.

The F1 measure is an indicator that combines the precision and recall rate. The corresponding F1 measure is large only when both values are large, therefore, it is a more representative indicator than a single detection accuracy or recall rate.
The comparison results show that SVM and KNN are much better than naive Bayes (but they are also better than rocchio, which is rarely evaluated ).
In terms of English Corpus, Reuters's Reuters-21578 "modapt 'E" is a commonly used test set, which has been tested by many people, sebastiani summarized in [23]. The results of the related algorithms are as follows:

classification algorithm	in Reuters-21578" modapt 'E " F1 measure
rocchio	0.776
Naive Bayes	0.795
KNN	0.823
SVM	0.864

According to the F1 measure, KNN is quite similar to the SVM algorithm, but F1 only reflects the classification effect (that is, classification accuracy), without considering the performance (that is, classification is not fast ). In general, SVM is an algorithm with good performance and performance.

As mentioned earlier, the final product of the training phase is the classifier. In the classification phase, only these classifiers are used to classify new documents. There is nothing to say.
the next chapter describes the list and simple explanations of concepts that have emerged so far, and introduces some concepts that will be used later. After that, we will talk about the classification of classification issues, the similarities and differences between Chinese and English classification issues, and the overview and comparison of several feature extraction algorithms ......

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More