The c4.5 of data mining algorithm

Source: Internet
Author: User
Tags id3

c4.5 algorithm is a classification decision tree algorithm in machine learning algorithm, and its core algorithm is ID3. c4.5 has made relative improvements to the ID3 algorithm. As follows

1 Use information gain rate instead of information gain. Because the information gain is used, it is biased to select more properties for the value.

2 pruning during the construction of a tree

3 Ability to complete discrete processing of continuous attributes

4 Processing of incomplete data

The c4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high.

Disadvantage: In the process of constructing the tree, the data sets need to be ordered and sorted multiple times, resulting in inefficient algorithm

Entropy:

The greater the uncertainty of the variable, the greater the entropy. Entropy is the quantification of information, the greater the uncertainty, the greater the entropy. Therefore, in the classification decision tree, the least entropy attribute can be chosen as the classification feature.

Information gain:

according to the name to understand, is the difference between the front and back information, in the decision tree classification problem, that is, the decision tree in the choice of attribute selection before and after the division of information difference value, that can be written as:

GAIN()=INFOBEFORESPLIT()–infoafterSpLit() is the entropy minus the splitting entropy

Examples are as follows

Outlook Temperature Humidity Windy Play?
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rain Mild High False Yes
Rain Cool Normal False Yes
Rain Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rain Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rain Mild High True No

The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.
We have calculated the information gain, which is listed here directly as follows:
DataSet D contains 14 training samples, of which 9 are of category "Yes" and 5 for category "No", the entropy of information is calculated:

1 Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940

The information entropy is computed separately for each attribute set in the attribute collection, as follows:

1 Info(OUTLOOK) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694
2 Info(TEMPERATURE) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911
3 Info(HUMIDITY) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789
4 Info(WINDY) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892

Based on the above data, we can calculate the information gain value that is dependent on selecting the first root node, which is calculated as follows:

1 Gain(OUTLOOK) = Info(D) - Info(OUTLOOK) = 0.940 - 0.694 = 0.246
2 Gain(TEMPERATURE) = Info(D) - Info(TEMPERATURE) = 0.940 - 0.911 = 0.029
3 Gain(HUMIDITY) = Info(D) - Info(HUMIDITY) = 0.940 - 0.789 = 0.151
4 Gain(WINDY) = Info(D) - Info(WINDY) = 0.940 - 0.892 = 0.048

Next, we calculate the split information metric h (V):

    • Outlook properties

Property Outlook has 3 values, where Sunny has 5 samples, rainy has 5 samples, overcast has 4 samples, then

1 H(OUTLOOK) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345
    • Temperature property

Attribute temperature has 3 values, in which hot has 4 samples, mild has 6 samples, cool has 4 samples, then

1 H(TEMPERATURE) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228
    • Humidity Property

Attribute humidity has 2 values, where normal has 7 samples and high has 7 samples, then

1 H(HUMIDITY) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0
    • Windy Property

Attribute windy has 2 values, where true has 6 samples, false has 8 samples, then

1 H(WINDY) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516

Based on the results above, we can calculate the information gain rate as follows:

1 IGR(OUTLOOK) = Info(OUTLOOK) / H(OUTLOOK) = 0.246/1.577406282852345 = 0.15595221261270145
2 IGR(TEMPERATURE) = Info(TEMPERATURE) / H(TEMPERATURE) = 0.029 / 1.5566567074628228 = 0.018629669509642094
3 IGR(HUMIDITY) = Info(HUMIDITY) / H(HUMIDITY) = 0.151/1.0 = 0.151
4 IGR(WINDY) = Info(WINDY) / H(WINDY) = 0.048/0.9852281360342516 = 0.0487196804926

The c4.5 of data mining algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.