Machine Learning [2] Calculation of entropy and information gain in decision trees, and construction of decision tree ID3

Source: Internet
Author: User
Tags id3

The information entropy is very bright. After you know the results of an event, the average amount of information will be given to you. When the uncertainty of an event increases, you need to find out the information required by the event, that is, the larger the information entropy, the more disordered and uncertain metric.

Calculation of information entropy:

-P [I] LOGP [I], with a base number of 2

Public static double calcentropy (int p []) {double entropy = 0; // used to calculate the total number of samples. P [I]/sum is the probability double sum of I = 0; int Len = P. length; For (INT I = 0; I <Len; I ++) {sum + = P [I];} For (INT I = 0; I <Len; I ++) {entropy-= P [I]/SUM * log2 (P [I]/SUM);} return entropy ;}
Given a sample array, the total number of samples is calculated cyclically in one round, and then the probability of each sample is obtained. Then, the formula can be used to calculate the total number of samples.

Information gain is the change value of information entropy. The node with the fastest decrease of information entropy can be used as the root node of decision tree, shortening the height of the tree.

The information gain of a property A to the sample set S is:

Gain (s, A) = H (S)-A attributes are weighted information entropy of known values.

Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
As shown in the data, a decision tree is constructed based on the data. In the future, you can decide whether to go out to play based on different weather conditions.

First, calculate the information entropy without knowing any weather conditions, and look at the play column directly. 9 for yes and 5 for no, so the formula is used to calculate the information entropy.

H =-9/14 * log (9/14)-5/14 * log (5/14) = 0.940

Calculate the information entropy of each attribute in sequence. First, check the outlook attribute. When outlook is a known value, calculate the information entropy.

1. When outlook is sunny, there are two play columns, three no columns, and the calculated information entropy is

H =-2/5 * log (2/5)-3/5 * log (3/5) = 0.971

2. When outlook = overcast, check the play column. If yes has 4 and no has 0, calculate the information entropy.

H = 0

3. When outlook = rainy, check the play column. There are three values for "yes", two for "no", and calculate the information entropy.

H = 0.971

When outlook is sunny, overcast, or rainy, the probability is 5/14, 4/14, and 5/14 respectively.

Therefore, when outlook is a known value, the information entropy is 5/14*0.971 + 4/14*0 + 5/14*0.971 = 0.693.

Therefore, the information gain of outlook attribute gain = 0.940-0.693 = 0.247


Similarly, the information gains of temperature, humidity, and windy are calculated as 0.029, 0.152, and 0.048 respectively.

The maximum information gain attribute is outlook. Therefore, this node is the root node of the decision tree.

Outlook Temperature Humidity Windy Play
  Yes No   Yes No   Yes No   Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 Trur 3 3    
Rainy 3 2 Cool 3 1                

All samples of branch overcast are positive examples, so they become leaf nodes with the target category as yes.














Machine Learning [2] Calculation of entropy and information gain in decision trees, and construction of decision tree ID3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.