c4.5 algorithm is a classification decision tree algorithm in machine learning algorithm, and its core algorithm is ID3. c4.5 has made relative improvements to the ID3 algorithm. As follows
1 Use information gain rate instead of information gain. Because the information gain is used, it is biased to select more properties for the value.
2 pruning during the construction of a tree
3 Ability to complete discrete processing of continuous attributes
4 Processing of incomplete data
The c4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high.
Disadvantage: In the process of constructing the tree, the data sets need to be ordered and sorted multiple times, resulting in inefficient algorithm
Entropy:
The greater the uncertainty of the variable, the greater the entropy. Entropy is the quantification of information, the greater the uncertainty, the greater the entropy. Therefore, in the classification decision tree, the least entropy attribute can be chosen as the classification feature.
Information gain:
according to the name to understand, is the difference between the front and back information, in the decision tree classification problem, that is, the decision tree in the choice of attribute selection before and after the division of information difference value, that can be written as:
GAIN()=INFOBEFORESPLIT()–infoafterSpLit() is the entropy minus the splitting entropy
Examples are as follows
Outlook |
Temperature |
Humidity |
Windy |
Play? |
Sunny |
Hot |
High |
False |
No |
Sunny |
Hot |
High |
True |
No |
Overcast |
Hot |
High |
False |
Yes |
Rain |
Mild |
High |
False |
Yes |
Rain |
Cool |
Normal |
False |
Yes |
Rain |
Cool |
Normal |
True |
No |
Overcast |
Cool |
Normal |
True |
Yes |
Sunny |
Mild |
High |
False |
No |
Sunny |
Cool |
Normal |
False |
Yes |
Rain |
Mild |
Normal |
False |
Yes |
Sunny |
Mild |
Normal |
True |
Yes |
Overcast |
Mild |
High |
True |
Yes |
Overcast |
Hot |
Normal |
False |
Yes |
Rain |
Mild |
High |
True |
No |
The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.
We have calculated the information gain, which is listed here directly as follows:
DataSet D contains 14 training samples, of which 9 are of category "Yes" and 5 for category "No", the entropy of information is calculated:
1 |
Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940 |
The information entropy is computed separately for each attribute set in the attribute collection, as follows:
1 |
Info(OUTLOOK) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694 |
2 |
Info(TEMPERATURE) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911 |
3 |
Info(HUMIDITY) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789 |
4 |
Info(WINDY) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892 |
Based on the above data, we can calculate the information gain value that is dependent on selecting the first root node, which is calculated as follows:
1 |
Gain(OUTLOOK) = Info(D) - Info(OUTLOOK) = 0.940 - 0.694 = 0.246 |
2 |
Gain(TEMPERATURE) = Info(D) - Info(TEMPERATURE) = 0.940 - 0.911 = 0.029 |
3 |
Gain(HUMIDITY) = Info(D) - Info(HUMIDITY) = 0.940 - 0.789 = 0.151 |
4 |
Gain(WINDY) = Info(D) - Info(WINDY) = 0.940 - 0.892 = 0.048 |
Next, we calculate the split information metric h (V):
Property Outlook has 3 values, where Sunny has 5 samples, rainy has 5 samples, overcast has 4 samples, then
1 |
H(OUTLOOK) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345 |
Attribute temperature has 3 values, in which hot has 4 samples, mild has 6 samples, cool has 4 samples, then
1 |
H(TEMPERATURE) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228 |
Attribute humidity has 2 values, where normal has 7 samples and high has 7 samples, then
1 |
H(HUMIDITY) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0 |
Attribute windy has 2 values, where true has 6 samples, false has 8 samples, then
1 |
H(WINDY) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516 |
Based on the results above, we can calculate the information gain rate as follows:
1 |
IGR(OUTLOOK) = Info(OUTLOOK) / H(OUTLOOK) = 0.246/1.577406282852345 = 0.15595221261270145 |
2 |
IGR(TEMPERATURE) = Info(TEMPERATURE) / H(TEMPERATURE) = 0.029 / 1.5566567074628228 = 0.018629669509642094 |
3 |
IGR(HUMIDITY) = Info(HUMIDITY) / H(HUMIDITY) = 0.151/1.0 = 0.151 |
4 |
IGR(WINDY) = Info(WINDY) / H(WINDY) = 0.048/0.9852281360342516 = 0.0487196804926 |
The c4.5 of data mining algorithm