Comparison and summary of text classification algorithms

Source: Internet
Author: User

This paper compares and summarizes several commonly used text classification algorithms, mainly expounds their merits and demerits, and provides the basis for the selection of algorithms.

First, Rocchio algorithm

The Rocchio algorithm should be considered as the first and most intuitive solution for people to think about text categorization problems. The basic idea is to take a category of the sample documents to the average (for example, all "sports" documents in the word "basketball" the number of times to take an average, and then the "referee" to take an average, in turn to do), you can get a new vector, the image called "centroid", The centroid is the most representative vector representation of this category. When new documents need to be judged, it is possible to compare the new document with the centroid to determine whether the new document belongs to this class. A slightly improved Rocchio algorithm not only considers documents belonging to this category (called positive samples), but also considers document data that is not part of this category (called negative samples), and calculates the centroid as close to the positive sample as possible from negative samples.

Its advantages are easy to implement, calculation (training and classification) is particularly simple, it is commonly used to measure the performance of the classification system of the benchmark system, and the practical classification system rarely use this algorithm to solve specific classification problems.

Second, naive Bayesian
Advantages:
1, naive Bayesian model originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency.
2, the NBC model needs to estimate a few parameters, the missing data is not too sensitive, the algorithm is relatively simple.
Disadvantages:
1. Theoretically, the NBC model has the smallest error rate compared with other classification methods. However, this is not always the case, because the NBC model assumes that the properties are independent of each other, which is often not true in practical applications (clustering is considered to be the first clustering of attributes with greater correlation), which has a certain effect on the correct classification of the NBC model. When the number of attributes is more or the correlation between attributes is large, the efficiency of the NBC model is inferior to the decision tree model. The performance of the NBC model is best when the attribute correlation is small.
2, need to know a priori probability.

3. Error rate of classification decision

Three, KNN algorithm (k-nearest neighbour)
Advantages:
1, simple and effective.
2. The cost of retraining is low (changes in the class system and training sets are common in web environments and e-commerce applications).
3. The calculation time and space are linear to the size of the training set (not too large in some cases).
4, because the KNN method is mainly around a limited number of adjacent samples, rather than relying on the method of discriminant domain to determine the category, so for the class domain intersection or overlap more than the sample set, KNN method is more suitable than other methods.
5, this algorithm is suitable for the automatic classification of the class domain with large sample capacity, while those with smaller sample capacity are more prone to error points in the class domain.
Disadvantages:
1, KNN algorithm is lazy learning method (lazy learning, basically do not learn), than some active learning algorithm to a lot faster.
2. The category score is not normalized (unlike a probability score).
3, the output can not be interpreted strongly, for example, the decision tree is more explanatory.
4, the algorithm in the classification of a major disadvantage is that when the sample is unbalanced, such as a class of sample capacity is very large, and other class sample capacity is very small, it is possible that when a new sample is entered, the sample of the K neighbors of the large-capacity class of samples accounted for a majority. The algorithm calculates only the "nearest" neighbor sample, the number of samples is large, or the sample is not close to the target sample, or the sample is close to the target sample. In any case, the quantity does not affect the running result. It can be improved by using the weighted method (which is larger than the neighbor with the small distance of the sample).

5, the calculation of a large amount. At present, the common solution is to pre-edit the known sample points in advance to remove the small sample of the role of classification.

Iv. Decision Trees (decision Trees)
Advantages:
1. Decision trees are easy to understand and explain. People have the ability to understand the meaning of decision trees after they have been explained.
2. For decision trees, the preparation of data is often simple or unnecessary. Other techniques often require that data be generalized first, such as removing extraneous or blank attributes.
3. Ability to handle both data and regular properties. Other technologies often require a single data attribute.
4, decision tree is a white box model. Given an observed model, it is easy to roll out the corresponding logical expression based on the resulting decision tree.
5, easy to pass the static test to the model evaluation. Indicates that it is possible to measure the credibility of the model.
6, in a relatively short period of time can be a large data source to make feasible and effective results.
7. You can construct a decision tree for datasets with many attributes.
8, the decision tree can be well extended to large databases, and its size is independent of the size of the database.
Disadvantages:
1, for those different types of sample quantity inconsistent data, in the decision tree, the results of information gain bias to those with more numerical characteristics.
2, the decision tree processing missing data difficulties.
3, over-fitting problems appear.

4. Ignore the correlation between attributes in the dataset.

V. Methods of adaboosting
1. AdaBoost is a kind of classifier with high accuracy.
2, you can use a variety of methods to build a sub-classifier, the AdaBoost algorithm provides a framework.
3, when the use of simple classifier, the calculated results are understandable. And the weak classifier construction is extremely simple.
4, simple, do not have to do feature screening.

5, do not worry about overfitting.

Six, support vector machine (SVM)
Advantages:
1, can solve the problem of machine learning in small sample case.
2, can improve the generalization performance.
3, can solve the high-dimensional problem.
4, can solve the nonlinear problem.
5, can avoid the neural network structure choice and the local minimum point problem.
Disadvantages:
1, sensitive to missing data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.