Getting started with text classification (6)

Source: Internet
Author: User
Tags svm

SVM algorithms
Support Vector Machine (SVM) was first proposed by Cortes and Vapnik in 1995, it has many unique advantages in solving small samples, non-linear and high-dimensional pattern recognition, and can be applied to function fitting and other machine learning problems [10].
the SVM method is based on the statistical learning theory based on VC theory and minimum structural risk principle, based on the limited sample information, we can find the best compromise between the complexity of the model (that is, the learning accuracy of specific training samples, accuracy) and the learning ability (that is, the ability to identify arbitrary samples without error, in order to obtain the best promotion capability [14] (or generalized ability ).
the SVM method has a solid theoretical foundation. The essence of SVM training is to solve a Quadratic Programming Problem (quadruple programming, which indicates that the target function is a quadratic function, the constraint condition is the optimization problem of Linear Constraints). The global optimal solution is obtained, which makes it superior to other statistical learning techniques. SVM classifier performs well in text classification and is one of the best classifiers. kernel function is used to convert the original sample space to a high-dimensional space, it can solve the problem of Linear Non-segmentation of original samples. The disadvantage is that the selection of kernel functions lacks guidance and it is difficult to select the best kernel function based on specific problems. In addition, the SVM training speed is greatly affected by the scale of the training set, resulting in high computing overhead, to solve the problem of SVM training speed, researchers have proposed many improvement methods, including the chunking method, Osuna algorithm, SMO algorithm, and interactive SVM [14].
the advantages of SVM classifier are good universality, high classification accuracy, fast classification speed, and irrelevant to the number of training samples, this method is superior to KNN And Naive Bayes in terms of accuracy and accuracy.
compared with other algorithms, the theoretical basis of SVM algorithms is complex, but it has broad application prospects, I plan to write a series of articles to discuss SVM algorithms in detail, stay tuned!

After introducing several representative algorithms, we may wish to compare their advantages and disadvantages with several groups of experimental data at home and abroad.
In the experiment of Chinese corpus, the document [6] uses the benchmark corpus provided by the Natural Language Processing Laboratory of Fudan University to test several classification algorithms based on the word Vector Space text model, this benchmark corpus is divided into 20 categories, with a total of 9804 training documents and 9833 testing documents. After unified word segmentation and noise word elimination, the performance indicators of each classification method are as follows.

The F1 measure is an indicator that combines the precision and recall rate. The corresponding F1 measure is large only when both values are large, therefore, it is a more representative indicator than a single detection accuracy or recall rate.
The comparison results show that SVM and KNN are much better than naive Bayes (but they are also better than rocchio, which is rarely evaluated ).
In terms of English Corpus, Reuters's Reuters-21578 "modapt 'E" is a commonly used test set, which has been tested by many people, sebastiani summarized in [23]. The results of the related algorithms are as follows:

classification algorithm

in Reuters-21578" modapt 'E " F1 measure

rocchio

0.776

Naive Bayes

0.795

KNN

0.823

SVM

0.864

According to the F1 measure, KNN is quite similar to the SVM algorithm, but F1 only reflects the classification effect (that is, classification accuracy), without considering the performance (that is, classification is not fast ). In general, SVM is an algorithm with good performance and performance.

As mentioned earlier, the final product of the training phase is the classifier. In the classification phase, only these classifiers are used to classify new documents. There is nothing to say.
the next chapter describes the list and simple explanations of concepts that have emerged so far, and introduces some concepts that will be used later. After that, we will talk about the classification of classification issues, the similarities and differences between Chinese and English classification issues, and the overview and comparison of several feature extraction algorithms ......

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.