Data mining-classifier information sorting-Mutual Information of Feature Selection

Source: Internet
Author: User

When both event a and Event B occur, mutual information is described as follows:

It indicates the amount of information provided because event a is associated with Event B.

When dealing with classification issues, we can use mutual information to measure the correlation between a feature and a specific category. If more information is available, the greater the correlation between the feature and the category. The opposite is true.

Take the corpus of sogou lab as an example. Select finance, IT products, sports, entertainment, and stocks, and use mutual information to select words to create a spatial vector model. One thing that needs to be done before selection is to remove words that appear only in one category and have a very low frequency, because these words are destined to have a high degree of mutual information with a particular type, and have a low level of mutual information with other types. As follows:

From the above table, we can see that it is not so ideal to use mutual information to select words. Why? After careful analysis, we can find that the influence of low Word Frequency on mutual information is still quite large. If a word is not frequently enough but mainly appears in a certain category, a higher mutual information will appear, this makes filtering noise. To avoid this problem, you can first sort words by Word Frequency and then sort them by the size of mutual information, and then select the words you want, so that you can better solve the problem. As follows:

 

Data mining-classifier information sorting-Mutual Information of Feature Selection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.