When both event a and Event B occur, mutual information is described as follows:
It indicates the amount of information provided because event a is associated with Event B.
When dealing with classification issues, we can use mutual information to measure the correlation between a feature and a specific category. If more information is available, the greater the correlation between the feature and the category. The opposite is true.
Take the corpus of sogou lab as an example. Select finance, IT products, sports, entertainment, and stocks, and use mutual information to select words to create a spatial vector model. One thing that needs to be done before selection is to remove words that appear only in one category and have a very low frequency, because these words are destined to have a high degree of mutual information with a particular type, and have a low level of mutual information with other types. As follows:
From the above table, we can see that it is not so ideal to use mutual information to select words. Why? After careful analysis, we can find that the influence of low Word Frequency on mutual information is still quite large. If a word is not frequently enough but mainly appears in a certain category, a higher mutual information will appear, this makes filtering noise. To avoid this problem, you can first sort words by Word Frequency and then sort them by the size of mutual information, and then select the words you want, so that you can better solve the problem. As follows:
Data mining-classifier information sorting-Mutual Information of Feature Selection