A summary of text categorization

Source: Internet
Author: User

After a period of time to get the text classification of things, now send a text to summarize.

The definition of a text classification problem is based on the content of a document, selecting the appropriate category from a predefined category label.

The basic steps of Chinese text classification are Chinese word segmentation, feature extraction, training model, prediction category and so on, it is necessary to explain that the text classification based on statistics generally needs to have better annotated corpus as training set, train the model, and use the model to classify the unclassified text.

Chinese text processing an unavoidable step is participle, Chinese is not like English, there are spaces between words as separators, most of the Chinese natural language processing can not avoid this step. The simple word segmentation algorithm is the positive inverse maximum matching (mechanical participle), n-gram, maximum entropy, hidden Markov and so on (statistical participle). Now the existing word breaker package is better than the Chinese Academy of Sciences Ictclas participle system, interested can try, but also to achieve their own simple mechanical participle, in the text classification, as long as there is a better word list (thesaurus can go to Sogou thesaurus crawl), The difference between the effects of maximum matching and other complex word segmentation algorithms is not very large.

After Chinese word segmentation, the text becomes a word, these words are the characteristics of the text, each article by containing different words, the number of different words to distinguish. However, if the word set is directly modeled after word segmentation, one word space is larger, resulting in low performance, and there are many low-frequency words and meaningless words in the set of words, and the effect of the classification will be reduced, the experiment proves that the effect after the feature extraction is better than the performance and effect of the non-extracting word space.

Feature extraction is to extract the most representative text features and the most distinguishing features of text. First, first of all the words in the space to remove the word, stop the word is mainly some meaningless high-frequency words, such as modal particles (, ah, and so on), there are similar, you and I he and other words, these words appear in almost all documents and will appear many times, and the document to express the theme of almost no relationship, so need to remove

After removing the discontinued words, the feature space is still very large, this time, it is necessary to use statistical laws to select the most distinguishing features of the text, the existing methods are mainly Chi-square statistics, information gain, mutual information, probability ratio, cross-entropy, inter-class information methods. The following describes the Chi-square statistics, other methods can also achieve good results, the students can check the information ha.


Because the CSDN is not very supportive of the formula, it is cut to a graph.

After feature extraction, there is a feature set, the document can be based on these features set to uniquely determine a vector, this is the VSM model, the Chinese called vector space model, the model hypothesis features and features are independent of each other, so although the loss of a lot of information, but from the effect, still very good, is now the most used model in the statistical text classification.

How does this model work? Say a chestnut, for example, there are two documents, documents 1 and 2, after participle and feature extraction, document 1 Word collection (Olympic Games, sports, champions, diving), document 2 words collection (Palestine-Israeli conflict, capitalism, America), then the feature word set is (Olympic Games, sports, champions, diving, Palestinian-Israeli conflict, Capitalism, the United States), the two documents are represented as vectors according to the order of words in the feature word set, the vector of document 1 is (1,1,1,1,0,0,0), and the vector of document two is (0,0,0,0,1,1,1). 0, 1 represents the weight of a particular word in a document, and it is clear that the weight here refers to the number of occurrences. The calculation of weights has a classic formula, TF-IDF, the last century was invented in the 670 's, the effect is very good, in many ways have been applied. However, it is said that the person who proposed this method is a mistake, I do not know why. TT, seemingly run away, continue to the subject.

After vectorization, it is necessary to use classification algorithm to model and classify. There are many kinds of algorithms for vector classification in machine learning, such as naive Bayesian algorithm, KNN algorithm, support vector machine, neural network algorithm and so on. Naive Bayesian is relatively simple, we mainly introduce naive Bayesian method.

The basic introduction of naive Bayesian algorithm is as follows:


Based on the prior probability and conditional probability estimation method and P (x| C) Different methods of calculation, naive Bayes is divided into polynomial model, Bernoulli model, Poisson model, and so on, this paper mainly introduces Bernoulli model and polynomial model.


From the above calculation formula, the Bernoulli model is based on the document granularity, and the polynomial model is the characteristic granularity.


Well, this text classification is now complete.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.