Text classification: Feature selection statistics

Source: Internet
Author: User

In text categorization, the statistics used for feature selection mainly include these:

    1. Characteristic frequency (term FREQUENCY,TF)

The principle is: low frequency often has little effect on the classification, which can be eliminated. At the same time, not so high-frequency is the impact of large, such as the text in the distribution of uniform high-frequency contribution is not small.

Application: Mainly used in text indexing to directly delete some low-frequency features

2. Text frequency (document FREQUENCY,DF)

The principle is that a rare entry may be noise, but it may also have a distinct effect on a category of other

Application: More than TF used together

3. Information entropy (characteristic entropy)

Equation Understanding: The value of a random variable x can provide the amount of information is log (1/p (x)), then information entropy is the value of these values can provide an average of the amount of data, pi is the probability that the characteristic w belongs to Ci, P (ci|w)

If PI represents the probability of Xi occurring, if the pi is approximately 1, then this XI can be understood to be about the fact that it is close to the general knowledge, so the less value to the prediction, that is, the less information entropy, so information entropy should be the single decrement function of pi. So the formula for Time XI, the information entropy is k, for each possible random event, the average information entropy is the information entropy of the source

If tossing a coin as a source, then the probability of a positive and negative is 0.5, indicating that the source of the more random, that is, the greater the entropy of information.

In the above example, the first case information entropy is 0.056k, and the second 0.693k (based on the natural base)

Application: the characteristic T as an event, and Class C is a system (source), each class is a random variable, then when t occurs, the conditional entropy of system C is when T appears in the text, the uncertainty of the system, that is, to determine the type of the text of uncertainty, so the characteristic entropy is less, The effect of this feature on classification is greater.

Formula:

4. Information gain (information gain)

Principle: Information gain measures the degree of uncertainty in the classification system before and after the occurrence of a feature. Then, for the appearance of the above formula is obviously, before the emergence of this can be understood, for a word, it is fixed, that is, we know that the word must appear in all text, or do not appear. But in the end, what kind of situation is fixed? The probability of each case needs to be averaged.

Formula:

The classification system contains the characteristic T (T is a feature, that is, T appears): H (C)

Classification system fixed features T (t not appearing): H (c| T),

Therefore, the information gain formula is as follows

The above formula is also equivalent to:

5. Mutual information (Mutual information)

Principle: The system C in each category of CI as an event, when the appearance of the feature only depends on a certain class, the mutual information is very large, when independent of each other, the mutual information is 0; When features seldom appear in the category, the mutual information is negative.

Formula:

6. X2 statistics (chi-square, CHI)

Principle: Not explained, more intuitive

Formula:

Application: Calculate the Chi value of the feature T Global, select a feature with a greater Chi value

Global calculation Mode 1:

Global Calculation Mode 2:

Text classification: Feature selection statistics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.