Some basic knowledge of information theory

Source: Internet
Author: User
Tags id3

Basic Concepts

First of all: in information theory, log log by default refers to the base 2. A lot of books and materials are directly to the Shannon information content formula, as shown in the formula (1), can not give a basic reason why such a form of rationalization. Here is a more straightforward example. Suppose randomly from 0~63, a total of 64 integers randomly and evenly selected one out, and try to use the form of binary to identify, the number of the selected, how many bits do you need at least? About why choose Bits, I think there is no need to explain, every bits is a yes or no of the two-yuan discriminant bit. What we get is the number 8, which means we need 8-bit bits to identify each number.

Here are some common calculation formulas

Self-Information

Federated Self-Information

Conditional self-Information

Information entropy

Entropy is the information expectation of all the possible values of all categories, where the entropy is very similar to the field of physical chemistry, perhaps a lot of people on the concept of entropy is very vague, or even do not understand, entropy is a description of the chaos of things, an indicator, and the nature of things are always towards chaos in the direction of development, So in the isolated system, the entropy is only increased, there is no need to strictly according to the conditions of the physics hypothesis to define, we simply think the higher the entropy, the more chaotic, the lower things more regular can be, like you put a stack of neat paper thrown out, the paper will only become more and more chaotic.

Conditional entropy

Joint entropy

According to the chain rules, there are

Can draw

Information Gain Information Gain

The original entropy of the system is H (X), and the entropy of the system (conditional entropy) is H (x|) when the condition y is known Y), the information gain is the difference between these two entropy values.

Entropy represents the uncertainty of the system, so the greater the information gain, the greater the contribution of the condition Y to the determination of the system. Frankly speaking, the more the initial entropy is very large, who can let me fall the fastest, that is, who can let me to the maximum degree of normalization, who is good, so for the formula (7), the greater the better the difference.

Application of information gain in feature selection

The information gain of the entry W can be directly introduced by the (7) formula, the X in the formula (7) represents the set of categories, Y is the existence of W and there are no two cases

P (CI) is the probability that the class I document appears, P (W) is the proportion of the document that contains w in the entire training set, and P (ci|w) represents the proportion of documents in the document collection that appear w that are part of category I, indicating the proportion of documents in the document collection that do not appear w that are part of category I.

The application of information gain in decision tree

This example is described in detail in the ID3 algorithm, which is the core idea of the ID3 algorithm.

Outlook Temperature Humidity Windy Play
Sunny Hot High FALSE No
Sunny Hot High TRUE No
Overcast Hot High FALSE Yes
Rainy Mild High FALSE Yes
Rainy Cool Normal FALSE Yes
Rainy Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Sunny Mild High FALSE No
Sunny Cool Normal FALSE Yes
Rainy Mild Normal FALSE Yes
Sunny Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Rainy Mild High TRUE No

(7) In the form of X means playing and not playing two cases.

Just look at the last column. The probability that we get to play is 9/14, the probability of not playing is 5/14. Therefore, in the absence of any prior information, the entropy (uncertainty) of the system is

Outlook Temperature Humidity Windy Play
  yes no   yes no   yes no   yes no yes no
sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5
overcast 4 0 mild 4 2 normal 6 1 trur 3 3    
Rainy 3 2 Cool 3 1                

If you select Outlook as the root node of the decision tree, y in the (7) style is set {sunny, Overcast, rainy}, and the conditional entropy at this time is

That is, when you select Outlook as the root node of the decision tree, the information gain is 0.94-0.693=0.247.

The same method calculates the information gain of the system when selecting temperature, humidity, windy as the root node, and selects the largest IG value as the final root node.

Mutual Information Mutual informantion

The mutual information of YJ on XI is defined as the logarithm of the ratio of the posterior probability to the prior probability.

The greater the mutual information, the greater the contribution of YJ to determining the value of XI.

Average mutual information of the system

The visible average mutual information is the information gain!

The application of mutual information in feature selection

Entry W with the mutual information of the category CI for

P (W) represents the percentage of the total number of document points that appear W, and P (W|CI) represents the proportion of the total number of documents in the document point where W is present in the category CI.

For the whole system, the reciprocal information of the entry W is

Finally, the largest first k entries of the mutual information are selected as feature items.

Here again, it is very rare to use the ID3 algorithm to classify, and the C4.5 algorithm based on ID3 optimization is very popular, the C4.5 is not the information gain, but the information gain rate concept, the information gain rate and C4.5 story will be sorted out in the future.

Some basic knowledge of information theory

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.