Text Categorization Overview

Source: Internet
Author: User
Tags svm idf

Transferred from: http://blog.csdn.net/csdwb/article/details/7082066

    1. An overview
    2. Two feature Selection
    3. Three classifiers
I. Overview

Text classification is a very important module in text processing, and its application is very extensive, such as: garbage filtering, News classification, POS tagging and so on. There is no essential difference between it and other classifications, the core method is to first extract the characteristics of the categorical data, then select the optimal matching, thus classifying. But the text also has its own characteristics, according to the characteristics of the text, the flow of text classification is: 1. preprocessing; 2. Text representation and feature selection; 3. Construction classifier; 4. Classification. Each module is described below.

1. Pretreatment

As you know, Chinese writing, unlike English, in the middle with a blank space, but the word and the word attached, so that the first step is to carry out participle , the text into words (or words), and the good or bad for the subsequent operation of the large impact (Word segmentation method has a dictionary method, statistical methods, etc., See my probability statistics model in the search engine application). The second step is to get rid of commonly used and insignificant words (called " Stop Words "), such as:, yes, and so on. In this way, our preprocessing phase is complete.

2. Text representation and feature selection

The current text representation is a vector space model , that is, each word after the text word segmentation as an element in a vector. But, what is the expression of this element? The first thought is the frequency of the occurrence of words, such as: "into/search engine, Learning/Search engine", we can be expressed as (0,...,2,...,0,1,...,0,1,0 ... )。 Why is there so much in the vector ... , that's because we have to unify all the vectors, meaning that which word is in the position of the vector must be determined. To be sure, you have to put all the characters in the dictionary order, and then each word. Assuming that there are 10,000 Chinese characters, then I want to classify "into the search engine, learning search engine" This short text, I will at least give it a 10000-dimensional vector, and this 10000-dimensional vector only used in 3 positions. But with the frequency will appear unfair phenomenon, for example, "we" the word, it appears the frequency is relatively high, that its vector is relatively large, so, word frequency almost do not have to do features, commonly used features such as TF/IDF, mutual information, data gain, χ2 statistics and other methods . Just mentioned that our vector dimension is very large, even if the removal of the stop word is also very large, since, the dimension is too large, our processing methods generally have two: Feature selection and feature extraction , feature selection is to select a part of the representative features to represent the text, for example, when using TF/IDF, Get rid of large and small values, the rest as features. Feature Extraction is a new feature reconstructed from the present feature, of course, this new feature dimension is smaller than the original feature dimension, is the dimensionality reduction, the most commonly used method is latent semantic analysis, with singular value decomposition (SVD), we know in the signal processing of a common processing method, is to map the spatial domain signal to the frequency domain (for example, Fft,wavelet), to the frequency domain, the signal energy is more concentrated and easy to handle. Its thinking and signal processing are actually the same.

3. Constructing classifiers

The next task is to construct classification rules, that is, classifiers, mainly to train a lot of data, to produce a set of models. commonly used methods are KNN, naive Bayesian, support vector machine, neural network, decision tree, Rocchio, linear least square fitting and so on .

4. Classification

After the classifier model has been generated, we just have to come to a text and go to the classifier one still, it will produce the category of the text.

Two. Feature Selection

Above the text classification of the summary, the following in-depth text classification of each module, starting from the feature selection, commonly used features to calculate the TF/IDF, mutual information, data gain, χ2 statistics and other methods.

1, TF/IDF

The idea of TF/IDF is that the more frequent (word-frequency) and rare words (low document frequencies) contribute.

Specifically, please see Google Blackboard on the Wu to write an easy-to-understand introduction TF/IDF article, if you still do not understand, then there is no way.

2. Information gain

The idea of information gain is to see how much the feature can bring to the classification system, and the more information it brings, the more important this feature is. The amount of information is measured by entropy, a variable x, the value has n, each probability is Pi, then the entropy of X is: H (x) =-∑pi log2 Pi.

For the classification system, Class C is a variable, it is possible to take a value of C1,C2,......,CN, and each category appears the probability is P (C1), P (C2), ..., P (Cn), n is the total number of categories. At this point the entropy of the classification system is: H (C) =-∑p (CJ) log2 P (CJ). Information gain is for one characteristic, that is to see a characteristic ti, the system considers it and does not consider it when the amount of information is how much, the difference between the two is the characteristics of the system to bring the amount of information. Obviously, the amount of information that does not consider any feature is H (C), and there are two situations in which a feature is considered: The feature appears and the feature does not appear.

The information that the feature ti appears is: P (TI) H (c|ti)

The information that the feature TI does not appear is: P (TI ') H (C|ti ')

Therefore, the information of TI is considered as follows:

H (c|ti) = P (ti) h (c|ti) + p (ti ') H (C|ti ')

=-P (ti) ∑p (cj|ti) log2 P (cj|ti)-P (ti ') ∑p (cj|ti) log2 P (Cj|ti ')

Among them, ∑ is J from 1...N.

Finally, the information gain of the characteristic ti:

Ig=h (C)-H (C|ti)

=-∑p (CJ) log2 P (CJ) –{p (TI) ∑p (cj|ti) log2 P (cj|ti) + P (TI ') ∑p (cj|ti) log2 P (cj|ti ')}

With the above formula, the other is better calculated, P (CI), indicating the probability that the category Ci appears in the corpus, in fact, as long as 1 divided by the total number of categories N is obtained (provided that each category equals); P (TI), is the probability that feature ti appears in Corpus, As long as the number of documents that have appeared in the TI divided by the total number of documents, and so on, such as P (Ci|ti) When TI is present, the probability of category CI appears, as long as the number of documents that appear ti and belong to the category CI divided by the number of documents in TI can be; P (TI ') Represents the probability of a document that does not contain a feature item ti in the Corpus; P (Cj|ti ') indicates that the document does not contain the conditional probability that the feature item Ti belongs to CI.

3. Χ2 Statistics

The χ2 statistic measures the degree of correlation between the eigenvalues Ti and the category CJ, and assumes that the χ2 distribution with first-order degrees of freedom is consistent between TI and CJ . The higher the χ2 statistic value for a class, the greater the correlation between it and the class, and the more class information it carries.

now assume that n is the total number of training corpus documents, a represents the number of documents containing the feature Ti and which belong to the CJ class; B indicates the number of documentscontaining the feature Ti and not the CJ class; C indicates the number of documents that do not contain the feature ti and belong to the CJ class; D indicates the number of documents that do not contain the feature ti and are not part of the CJ class. So, you can calculate what the number of documents that theoretically contain the feature Ti and belong to the CJ class: Ea= (A+c) (a+b)/n,(A+c is the number of documents that belong to the CJ class, (a+b)/n is characteristic TI's probability of occurrence ), then its root test da= (A-EA) 2/EA. Similarly, you can calculate the other three kinds of conditions of the root test, then

Χ2 (TI,CJ) = DA+DB+DC+DD= {NX (AXD-CXB)}/{(A+C) x (b+d) x (a+b) x (C+d)}

4. Mutual information

The idea of mutual information (MI) is that the greater the amount of interaction, the greater the degree of co-occurrence between the eigenvalues Ti and the category CJ.

I (Ti, CJ) = Log{p (Ti, CJ)/P (TI) p (CJ)}

=log{p (ti| CJ)/P (TI)}

≈log{(AxN)/(A+C) x (A+B)}

Where the symbol represents the same as above.

The most common method of selecting features is to calculate a value for each word according to one of the four methods above, and then set a threshold to take out all words that are greater than the threshold value and make up the eigenvectors as eigenvalues.

Three. Classifier

The following is a text classification of the classifier, commonly used methods have KNN, naive Bayesian, support vector machine, neural network, decision tree, Rocchio, linear minimum square fitting and so on, here only to introduce a good performance of the four classifiers: KNN, naive Bayesian, Rocchio, Support Vector Machine (SVM).


1. K-Nearest Neighbor algorithm (KNN)

The basic idea of KNN algorithm is: Given a test document, the system finds the nearest (most similar) K document in the training set, and then decides the category of the test document according to the category of the K document. The steps are as follows:

(1) Feature selection of the training document and test document, which are represented as text vectors.

(2) In the training focus, select the most similar to the test document K text, the formula is as follows (COS):

Sim (T, D) =∑I=1..N (Tixdi)/{(∑I=1..N Ti2) x (∑I=1..N Di2)}1/2

where T is the test document and D is the training document.

(3) in the calculated K nearest neighbor document, the weights of each class are computed in turn, the formula is as follows:

p (t, Cj) =∑di∈knnsim (T, d i) Y (d i,cj)

where T is the test document, D i is the selected K training document, I,y (d I,CJ) is the membership function, if the document D I, belongs to the Cj class, its value is 1, and vice versa is 0.

(4) The weight of the comparison class, the text is divided into the category of the largest weight.



2. Naive Bayesian

The basic idea of naive Bayesian classifier is to estimate the class probability of a given document by using the joint probability of the feature term and class. And assuming that the text is a word-based unary model, that is, the words are independent of each other. According to the Bayesian formula, the probability that document D belongs to the CJ class is:

P (cj| D) = P (d| CJ) XP (CJ)/P (D)

Also, P (d| CJ) = P (D (t1) | CJ) X...xp (D (TN) | CJ), P (d) is a constant, p (CJ) =n (CJ)/n,p (d (TI) | CJ) = (n (ti, CJ) +1)/(n (CJ) +m), where N (CJ) represents the number of text in the training text that belongs to the Cj class, N is the amount of training text, N (ti, CJ) represents the number of training text containing feature TI in the category Cj, and M represents the total number of feature items in

Just calculate all categories of P (cj| D), and then which is larger, which category the document belongs to.



3. Rocchio classifier

The basic idea of the Rocchio classifier is to create a eigenvector for each training text D first, and then use the eigenvectors of the training text to create a prototype vector for each class. When given a text to be classified, calculate the distance between the text to be classified and the prototype vector of each category, the distance can be the vector dot product, the cosine of the angle between the vectors, or other functions, according to the calculated distance value to decide which category the text belongs to. This prototype vector is calculated in several ways, the simplest of which is to average the eigenvectors of all the text of the class. The effect of this classifier is second only to KNN and SVM method.



4. Support Vector Machine (SVM)

The basic idea of SVM classification method is to find a decision plane in vector space, which can best divide the data points in two categories.

Text Categorization Overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.