The c4.5 of data mining algorithm

Last Update:2015-08-08 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

c4.5 algorithm is a classification decision tree algorithm in machine learning algorithm, and its core algorithm is ID3. c4.5 has made relative improvements to the ID3 algorithm. As follows

1 Use information gain rate instead of information gain. Because the information gain is used, it is biased to select more properties for the value.

2 pruning during the construction of a tree

3 Ability to complete discrete processing of continuous attributes

4 Processing of incomplete data

The c4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high.

Disadvantage: In the process of constructing the tree, the data sets need to be ordered and sorted multiple times, resulting in inefficient algorithm

Entropy:

The greater the uncertainty of the variable, the greater the entropy. Entropy is the quantification of information, the greater the uncertainty, the greater the entropy. Therefore, in the classification decision tree, the least entropy attribute can be chosen as the classification feature.

Information gain:

according to the name to understand, is the difference between the front and back information, in the decision tree classification problem, that is, the decision tree in the choice of attribute selection before and after the division of information difference value, that can be written as:

GAIN()=INFOBEFORESPLIT()–infoafterSpLit() is the entropy minus the splitting entropy

Examples are as follows

Outlook	Temperature	Humidity	Windy	Play?
Sunny	Hot	High	False	No
Sunny	Hot	High	True	No
Overcast	Hot	High	False	Yes
Rain	Mild	High	False	Yes
Rain	Cool	Normal	False	Yes
Rain	Cool	Normal	True	No
Overcast	Cool	Normal	True	Yes
Sunny	Mild	High	False	No
Sunny	Cool	Normal	False	Yes
Rain	Mild	Normal	False	Yes
Sunny	Mild	Normal	True	Yes
Overcast	Mild	High	True	Yes
Overcast	Hot	Normal	False	Yes
Rain	Mild	High	True	No

The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.
We have calculated the information gain, which is listed here directly as follows:
DataSet D contains 14 training samples, of which 9 are of category "Yes" and 5 for category "No", the entropy of information is calculated:

`1`	`Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940`

The information entropy is computed separately for each attribute set in the attribute collection, as follows:

`1`	`Info(OUTLOOK) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694`

`2`	`Info(TEMPERATURE) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911`

`3`	`Info(HUMIDITY) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789`

`4`	`Info(WINDY) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892`

Based on the above data, we can calculate the information gain value that is dependent on selecting the first root node, which is calculated as follows:

`1`	`Gain(OUTLOOK) = Info(D) - Info(OUTLOOK) = 0.940 - 0.694 = 0.246`

`2`	`Gain(TEMPERATURE) = Info(D) - Info(TEMPERATURE) = 0.940 - 0.911 = 0.029`

`3`	`Gain(HUMIDITY) = Info(D) - Info(HUMIDITY) = 0.940 - 0.789 = 0.151`

`4`	`Gain(WINDY) = Info(D) - Info(WINDY) = 0.940 - 0.892 = 0.048`

Next, we calculate the split information metric h (V):

Outlook properties

Property Outlook has 3 values, where Sunny has 5 samples, rainy has 5 samples, overcast has 4 samples, then

`1`	`H(OUTLOOK) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345`

Temperature property

Attribute temperature has 3 values, in which hot has 4 samples, mild has 6 samples, cool has 4 samples, then

`1`	`H(TEMPERATURE) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228`

Humidity Property

Attribute humidity has 2 values, where normal has 7 samples and high has 7 samples, then

`1`	`H(HUMIDITY) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0`

Windy Property

Attribute windy has 2 values, where true has 6 samples, false has 8 samples, then

`1`	`H(WINDY) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516`

Based on the results above, we can calculate the information gain rate as follows:

`1`	`IGR(OUTLOOK) = Info(OUTLOOK) / H(OUTLOOK) = 0.246/1.577406282852345 = 0.15595221261270145`

`2`	`IGR(TEMPERATURE) = Info(TEMPERATURE) / H(TEMPERATURE) = 0.029 / 1.5566567074628228 = 0.018629669509642094`

`3`	`IGR(HUMIDITY) = Info(HUMIDITY) / H(HUMIDITY) = 0.151/1.0 = 0.151`

`4`	`IGR(WINDY) = Info(WINDY) / H(WINDY) = 0.048/0.9852281360342516 = 0.0487196804926`

The c4.5 of data mining algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More