In text categorization, the statistics used for feature selection mainly include these:
- Characteristic frequency (term FREQUENCY,TF)
The principle is: low frequency often has little effect on the classification, which can be eliminated. At the same time, not so high-frequency is the impact of large, such as the text in the distribution of uniform high-frequency contribution is not small.
Application: Mainly used in text indexing to directly delete some low-frequency features
2. Text frequency (document FREQUENCY,DF)
The principle is that a rare entry may be noise, but it may also have a distinct effect on a category of other
Application: More than TF used together
3. Information entropy (characteristic entropy)
Equation Understanding: The value of a random variable x can provide the amount of information is log (1/p (x)), then information entropy is the value of these values can provide an average of the amount of data, pi is the probability that the characteristic w belongs to Ci, P (ci|w)
If PI represents the probability of Xi occurring, if the pi is approximately 1, then this XI can be understood to be about the fact that it is close to the general knowledge, so the less value to the prediction, that is, the less information entropy, so information entropy should be the single decrement function of pi. So the formula for Time XI, the information entropy is k, for each possible random event, the average information entropy is the information entropy of the source
If tossing a coin as a source, then the probability of a positive and negative is 0.5, indicating that the source of the more random, that is, the greater the entropy of information.
In the above example, the first case information entropy is 0.056k, and the second 0.693k (based on the natural base)
Application: the characteristic T as an event, and Class C is a system (source), each class is a random variable, then when t occurs, the conditional entropy of system C is when T appears in the text, the uncertainty of the system, that is, to determine the type of the text of uncertainty, so the characteristic entropy is less, The effect of this feature on classification is greater.
Formula:
4. Information gain (information gain)
Principle: Information gain measures the degree of uncertainty in the classification system before and after the occurrence of a feature. Then, for the appearance of the above formula is obviously, before the emergence of this can be understood, for a word, it is fixed, that is, we know that the word must appear in all text, or do not appear. But in the end, what kind of situation is fixed? The probability of each case needs to be averaged.
Formula:
The classification system contains the characteristic T (T is a feature, that is, T appears): H (C)
Classification system fixed features T (t not appearing): H (c| T),
Therefore, the information gain formula is as follows
The above formula is also equivalent to:
5. Mutual information (Mutual information)
Principle: The system C in each category of CI as an event, when the appearance of the feature only depends on a certain class, the mutual information is very large, when independent of each other, the mutual information is 0; When features seldom appear in the category, the mutual information is negative.
Formula:
6. X2 statistics (chi-square, CHI)
Principle: Not explained, more intuitive
Formula:
Application: Calculate the Chi value of the feature T Global, select a feature with a greater Chi value
Global calculation Mode 1:
Global Calculation Mode 2:
Text classification: Feature selection statistics