Feature Selection Method in text classification-chi-square test and information gain

Last Update:2018-12-05 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

-1. Misunderstanding of TF-IDF

TF-IDF can effectively assess the importance of a word to one of a collection or corpus. Because it comprehensively represents the importance of the word in the document and the document discrimination. However, it is not enough to judge whether a feature has discrimination by simply using TF-IDF in text classification.

1) It does not consider the distribution of feature words among classes. That is to say, the selected feature should appear more in a certain type, but fewer in other classes, that is, the differences in the frequency of various types of documents. If a Feature Word is evenly distributed across classes, such words do not contribute to classification. However, if a Feature Word is concentrated in a class, in other classes, such words can represent the characteristics of this class, And TF-IDF cannot distinguish between the two situations. 2) The distribution of Feature Words in class internal documents is not considered. In documents within the class, if feature words are evenly distributed among them, this feature word can represent the features of this class. If only a few documents appear, in other such documents, it is clear that such feature words cannot represent the features of this class.

Feature Extraction aims at dimensionality reduction. Even if you randomly select a part of the word, the classification effect will not be astonishing. Adopted the classification of TF-IDF method of course also can have good results. Of course, TF-IDF is used for vector space model, and it is quite effective to calculate document similarity.

0. Feature Selection Method

The amount observed in the text is actually only two: Word Frequency and document frequency. All methods are calculated based on these two quantities. The characteristics selected by TF-IDF of the two are not classified by category. Feature Selection Algorithms Based on Document Frequency include document frequency methods (which are directly sorted based on document frequency), Chi-square test, information gain, and mutual information.

The experimental results of English Text show that the chi-square test and information gain are the best when used as the feature selection method (different feature selection algorithms are used for the same classification algorithm to obtain the comparison results ); the performance of the Document Frequency Method is roughly the same as that of the first two. The performance of the term strength method is average. The performance of the mutual information method is the worst.

1. Chi-square test)

The most basic idea of the chi-square test is to determine whether the theory is correct by observing the deviation between the actual value and the theoretical value. When doing this, we often first assume that the two variables are indeed independent ("original hypothesis"), and then observe the actual value (observed value) the degree of deviation from the theoretical value (this theoretical value refers to the value that should exist if the two are indeed independent). If the deviation is small enough, we consider the error as a natural sample error, it is caused by inaccurate measurement methods or accidental occurrence. The two are indeed independent and accept the original assumptions. If the deviation is large to a certain extent, this error is unlikely to be caused by accidental occurrence or inaccurate measurement. We believe that the two are actually related, that is, to deny the original hypothesis and accept the alternative hypothesis.

The theoretical value is E, the actual value is X, and the formula for calculating the degree of deviation is:

This formula is the difference value measurement formula used by the square test. When the observed values of several samples are provided: x1, x2 ,...... XI ,...... After XN, the formula can be substituted to obtain the open value. This value is used to compare with the preset threshold. If the value is greater than the threshold (that is, the deviation is very large), the original hypothesis is considered invalid, otherwise, the original assumption is true.

In the feature selection stage of text classification, the original hypothesis is generally made using "word T is irrelevant to Class C". The larger the calculated square value, the larger the deviation from the original hypothesis, the more we tend to think that the opposite of the original assumption is true. The selection process calculates the order between each word and class C, ranging from large to small (the greater the value, the more relevant). You can obtain the first K values.

For example, N documents are classified into sports and non-sports, and the correlation between the word "basketball" and the category "Sports" is investigated.

Feature Selection	1. belongs to "Sports"	2. Not "Sports"	Total
1. including "Basketball"	A	B	A + B
2. "basketball" is not included"	C	D	C + D
Total	A + C	B + d	N

According to the original assumption, the proportion of documents containing basketball in the "Sports" category should be the same as that in all documents containing basketball. Therefore, the theoretical value of A should be:

Difference value:

Calculate the difference values D12, D21, and D22 in the remaining three cases. Finally, calculate the open Value of "basketball" and "Sports" articles:

Further simplification, note that if given a document set (such as our training set) and a category, then n, m, N-M (I .e. a + C and B + D) all words in a document of the same category are the same, but we only care about the size sequence of a bunch of words on the open value of a category, and do not care about the specific value, therefore, we can remove them completely, so we use them in actual computing.

The disadvantage of the chi-square test is that it only counts whether a word appears in the document, no matter how many times it appears. This will make him biased towards low-frequency words (because it exaggerated the role of low-frequency words ). In some cases, a word appears only once in each document of a class of articles, but its open value is greater than 10 times in the document of the class of articles 99%, in fact, the following words are more representative, but only because the number of documents is less than the previous word "1 ", during feature selection, the latter words may be screened out and the former is retained. This is the famous "Low Frequency Word defect" tested by the open party ". Therefore, the test is often used in combination with other factors such as Word Frequency to foster strengths and circumvent weaknesses.

2. Information Gain (IG, information gain)

Information entropy (Information volume) (system)

It means that a variable may change more (but it has nothing to do with the specific value of the variable, only the type of the value and the probability of occurrence), and the more information it carries.

A system has a feature t, and the amount of information is different between the system and the system. The difference between the two is that the information that this feature brings to the system is more ordered, and the information entropy is lower; conversely, the more chaotic a system is, the higher the information entropy. Therefore, information entropy is also a measure of the degree of system orderliness.

Information Gain (feature)

It refers to the effective reduction of expected information or information entropy.

For a feature T, when the system has it and does not have it, the amount of information is the difference between the two is the amount of information this feature brings to the system. It is the information entropy, but it is the conditional entropy without it.

Conditional Entropy: calculates the amount of information in the system when a feature t does not change.

For a feature X, it may take n or more values (x1, x2 ,......, XN), calculate the Conditional Entropy of each value, and then take the average value.

In text classification, the value of Feature Word T is only T (indicating T appears) and (indicating t does not appear ). So

Finally, information gain

However, the biggest problem with information gain is that it can only examine the contribution of features to the entire system, rather than specific to a specific category, this makes it only suitable for "Global" feature selection (that is, all classes use the same feature set ), however, you cannot select "local" features (each category has its own feature set, because some words have a distinction for this category and are irrelevant to another category ).

Appendix: Feature Extraction steps

1. Chi-square test

1.1 count the total number of documents in the sample set (n ).

1.2 count the occurrence frequency (A), occurrence frequency (B), AND OCCURRENCE FREQUENCY OF negative documents for each word.

1.3 calculate the card value of each word. The formula is as follows:

1.4 sort each word by chi-square value from large to small, and select the first K words as features, k is the feature dimension.

2. Information Gain

2.1 count documents of positive and negative categories: N1 and N2.

2.2 count the occurrence frequency (A), occurrence frequency (B), AND OCCURRENCE FREQUENCY OF negative documents for each word.

2.3 calculate information entropy

2.4 calculate the information gain of each word

2.5 sort each word by information increase value from large to small, and select the first K words as features, k is the feature dimension.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More