Machine Learning [2] Calculation of entropy and information gain in decision trees, and construction of decision tree ID3

Last Update:2014-08-25 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The information entropy is very bright. After you know the results of an event, the average amount of information will be given to you. When the uncertainty of an event increases, you need to find out the information required by the event, that is, the larger the information entropy, the more disordered and uncertain metric.

Calculation of information entropy:

-P [I] LOGP [I], with a base number of 2

Public static double calcentropy (int p []) {double entropy = 0; // used to calculate the total number of samples. P [I]/sum is the probability double sum of I = 0; int Len = P. length; For (INT I = 0; I <Len; I ++) {sum + = P [I];} For (INT I = 0; I <Len; I ++) {entropy-= P [I]/SUM * log2 (P [I]/SUM);} return entropy ;}

Given a sample array, the total number of samples is calculated cyclically in one round, and then the probability of each sample is obtained. Then, the formula can be used to calculate the total number of samples.

Information gain is the change value of information entropy. The node with the fastest decrease of information entropy can be used as the root node of decision tree, shortening the height of the tree.

The information gain of a property A to the sample set S is:

Gain (s, A) = H (S)-A attributes are weighted information entropy of known values.

Outlook	Temperature	Humidity	Windy	Play
Sunny	Hot	High	False	No
Sunny	Hot	High	True	No
Overcast	Hot	High	False	Yes
Rainy	Mild	High	False	Yes
Rainy	Cool	Normal	False	Yes
Rainy	Cool	Normal	True	No
Overcast	Cool	Normal	True	Yes
Sunny	Mild	High	False	No
Sunny	Cool	Normal	False	Yes
Rainy	Mild	Normal	False	Yes
Sunny	Mild	Normal	True	Yes
Overcast	Mild	High	True	Yes
Overcast	Hot	Normal	False	Yes
Rainy	Mild	High	True	No

As shown in the data, a decision tree is constructed based on the data. In the future, you can decide whether to go out to play based on different weather conditions.

First, calculate the information entropy without knowing any weather conditions, and look at the play column directly. 9 for yes and 5 for no, so the formula is used to calculate the information entropy.

H =-9/14 * log (9/14)-5/14 * log (5/14) = 0.940

Calculate the information entropy of each attribute in sequence. First, check the outlook attribute. When outlook is a known value, calculate the information entropy.

1. When outlook is sunny, there are two play columns, three no columns, and the calculated information entropy is

H =-2/5 * log (2/5)-3/5 * log (3/5) = 0.971

2. When outlook = overcast, check the play column. If yes has 4 and no has 0, calculate the information entropy.

H = 0

3. When outlook = rainy, check the play column. There are three values for "yes", two for "no", and calculate the information entropy.

H = 0.971

When outlook is sunny, overcast, or rainy, the probability is 5/14, 4/14, and 5/14 respectively.

Therefore, when outlook is a known value, the information entropy is 5/14*0.971 + 4/14*0 + 5/14*0.971 = 0.693.

Therefore, the information gain of outlook attribute gain = 0.940-0.693 = 0.247

Similarly, the information gains of temperature, humidity, and windy are calculated as 0.029, 0.152, and 0.048 respectively.

The maximum information gain attribute is outlook. Therefore, this node is the root node of the decision tree.

Outlook			Temperature			Humidity			Windy			Play
	Yes	No		Yes	No		Yes	No		Yes	No	Yes	No
Sunny	2	3	Hot	2	2	High	3	4	False	6	2	9	5
Overcast	4	0	Mild	4	2	Normal	6	1	Trur	3	3
Rainy	3	2	Cool	3	1

All samples of branch overcast are positive examples, so they become leaf nodes with the target category as yes.

Machine Learning [2] Calculation of entropy and information gain in decision trees, and construction of decision tree ID3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More