The purpose of feature selection is to select the most helpful feature items for classification. But to the computer to deal with the words, need to quantify. So how to choose the most helpful, there are a variety of methods.
In general, the choice of characteristics at 3000, the overall benefit is very good, and then go up, occupy space increases, but the result is not obvious growth.
Information gain: It measures the importance of the feature item according to the amount of information that TI can provide for the whole classification, thus deciding the trade-off of the feature item.
The information gain of a certain characteristic ti is the difference of the amount that can be provided for the whole classification, and the amount of it is measured by entropy.
Entropy can be considered as the number of uncertainties that describe a random variable. The greater the entropy, the greater the uncertainty, and the less likely it is to correctly estimate its value.
"I've always felt that entropy is a great invention." We could not measure the size of the information, and the invention of entropy solved the problem completely. Worship Shannon. 』
specifically to text categorization, we now have a term ti that calculates its information gain to determine whether it is useful for categorization. So, first look at the entropy of the document without any features, that is, when there are no features to classify, how much information we have. After considering the feature, how much information we can have. Obviously, the difference between the two is the information that this feature brings to us. This time may have the question, the front information is few, the later information is many, subtracts is not the negative number.
No We use entropy here, the degree of confusion, the degree of uncertainty. How much information is calculated is how much uncertainty is calculated. So the uncertainty in front is so great that it helps us to classify less useful information, and after considering the new features, the latter's uncertainty is small and the information is much more. So the difference between the two is the information that this feature brings to us.
Reference: "Statistical natural language Processing" Zongchengqing
The biggest problem with information gain is that it only examines the contribution of features to the system as a whole, rather than being specific to a category, this makes it only appropriate to do what is called a "global" feature selection (meaning that all classes use the same set of features), rather than "local" feature selection (each category has its own set of features, Because some words are very distinguishable from this category, they are insignificant to another category.
Reference: Http://baike.baidu.com/view/1231985.htm?fromTaglist