Common preprocessing methods for text modeling--Feature selection methods (Chi and IG)

Source: Internet
Author: User
Tags sort idf

This article about Tf-idf/chi/ig.


Reference:

Http://blog.sina.com.cn/s/blog_6622f5c30101datu.html

http://lovejuan1314.iteye.com/blog/651460



1) TF-IDF in the feature selection of the misunderstanding.

TF-IDF is used for vector space model, and the calculation of document similarity is quite effective. But it is not enough to use TF-IDF in the text classification to judge whether a feature has a distinguishing degree .

= = = It only takes into account the importance of the word in the document and the degree of document differentiation.

= = = "It does not take into account the distribution of feature words between classes ." Feature selection should have more features in one class, and less in other classes, that is to investigate the differences in document frequencies. If a characteristic word is distributed evenly among the classes, such words have no contribution to the classification; but if a feature word is distributed in a class in a relatively concentrated way, and it hardly appears in other classes, such a word can represent the characteristics of this class well, and TF-IDF cannot distinguish between the two cases.

= = = "It does not take into account the distribution of feature words in the internal documents of the class." In a document inside a class, if the feature word is evenly distributed among them, the feature Word can represent the characteristics of the class well, and if it appears only in a few documents and does not appear in other documents in this class, it is obvious that such feature words cannot represent the characteristics of this class.



2) Overview of Feature selection methods.

The amount of text that can be observed is actually only two: Word frequency and document frequencies, all methods are based on these two quantities.

The experimental results of plain text in English show that: as a feature selection method, the effect of chi-square test and information gain is the best (the same classification algorithm, using different feature selection algorithm to get the comparison result); the performance of document frequency methods (directly based on document frequency size) is roughly the same as the previous two, The term strength method performs in general; The mutual information method has the worst performance.



3) information gain.

In text categorization, the value of the feature word T is only t (for T appears) and (for T) does not appear. So

Finally, the information gain

But the biggest problem with information gain is that it can only look at the contribution of features to the system as a whole, not specific to a particular category , making it suitable only for the so-called "global" feature selection (meaning that all classes use the same set of features), rather than "local" Feature selection (each category has its own set of features, because there are words that are highly differentiated for this category and insignificant for another).

Implementation method:

1 Number of documents for statistical positive and negative classification: N1, N2.

2 statistics Each word's body file frequency (A), negative document frequency (B), the body file does not appear frequency (C), negative documents do not appear frequency (D).

3 Entropy of computing information

4 Calculating the information gain for each word
5 each word according to the information gain value from the large to the small sort, selects the first k words as the characteristic, K is the characteristic dimension.



4) Chi-square detection, chi-square test.

The most basic idea of chi-square test is to determine whether the theory is correct by observing the deviation between the actual value and the theoretical value . When doing it, it is often assumed that two variables are indeed independent ("original hypothesis"), and then observe the deviation of the actual value (the observed value) and the theoretical value (which is the value that should exist if the two are really independent), and if the deviation is small enough, we think the error is a natural sample error, It is the fact that the measurement means is not accurate enough or happens accidentally, and the two are indeed independent, accepting the original hypothesis, if the deviation is too large to a certain extent, so that the error is not likely to occur accidentally or inaccurate measurement, we think that the two are actually related, that is, to negate the original hypothesis, and accept the alternative hypothesis.

The theoretical value is E, the actual value is x, and the degree of deviation is calculated as:

This formula is the difference measure used in the root test. When the observation value of several samples is provided X1,X2,......XI,......XN, the card-square value can be obtained by substituting in the formula, compared with the pre-set threshold value, if it is greater than the threshold value (that is, the deviation is very large), it is considered that the original hypothesis is not established, whereas the original hypothesis is considered to be established.

In the feature selection stage of text classification, the general use of "Word t and category C irrelevant " to make the original hypothesis , the larger the calculated root value, the greater the deviation from the original hypothesis, the more we tend to think that the original hypothesis of the reverse case is correct. The process of selection for each word to calculate its root value with category C, from large to small order (at this time the larger the root of the more relevant), take the first k can be.

The disadvantage of the chi-square test is that it only counts whether the document has a word, regardless of how many times it appears. This makes him biased towards low-frequency words (because it exaggerates the role of low-frequency words). There may even be situations where a word appears only once in each document of a class of articles, and its root value is larger than the 10 occurrences of the word in the 99% document, but the latter is more representative, but only because the number of documents it appears is less than the previous word "1", When the feature is selected, it is possible to sift out the latter words and retain the former. This is the famous "low-frequency word defect" of the prescription test. Therefore, the root test also often with other factors such as word frequency comprehensive consideration to avoid weaknesses.

Implementation method:

1 Total number of documents in the statistical sample set (N).

2 statistics Each word's body file frequency (A), negative document frequency (B), the body file does not appear frequency (C), negative documents do not appear frequency (D).

3 Calculate the chi-square value of each word, the formula is as follows:

4 Each word according to the Chi-square value from the big to the small sort, selects the former K word as the characteristic, K is the characteristic dimension.







The text modeling series is constantly updated ....

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.