C4.5 is another classification decision tree algorithm in machine learning algorithm, it is an important algorithm based on ID3 algorithm, the improvement has the following key points compared to the ID3 algorithm:
1) Use the information gain rate to select attributes. ID3 Select attributes are subtree information gain, there are many ways to define information, ID3 use entropy (entropy, entropy is a measure of purity), that is, entropy change value, and C4.5 with the information gain rate.
2) Pruning in the decision tree construction process, because some nodes with very few elements may make the constructed decision tree too adaptive (Overfitting), which may be better if these nodes are not considered.
3) can also be processed for non-discrete data.
4) Ability to process incomplete data .
How is the information gain rate calculated?
Once you are familiar with the ID3 algorithm, you already know how to calculate the information gain, and the formula is as follows (from Wikipedia):
Or, use another formula that is more intuitive and easy to understand:
- The attribute set a of training dataset D is divided according to the class label, and the information entropy is obtained:
- A set of information entropy is obtained by dividing each attribute in attribute set a:
- Calculate information gain
Then the information gain is computed, that is, the former is bad for the latter, and a set of information gain is obtained for the attribute set a:
In this way, the information gain is calculated.
- Calculate information gain Rate
Below, the formula for calculating the information gain rate is as follows (from Wikipedia):
Where IG represents the information gain, which is calculated according to the process we described earlier. and IV is what we need to calculate now, it is a measure to consider splitting information, the splitting information used to measure the breadth and uniformity of the attribute splitting data, the formula is as follows (from Wikipedia):
To simplify, look at the following formula more intuitively:
where v represents the full value of a property in the property collection A.
Example analysis
We illustrate how the C4.5 algorithm calculates the information gain and chooses the decision node using a typical example of a training dataset D, which has been quoted many times.
The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.
We have calculated the information gain, which is listed here directly as follows:
DataSet D contains 14 training samples, of which 9 are of category "Yes" and 5 for category "No", the entropy of information is calculated:
1 |
Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940 |
The information entropy is computed separately for each attribute set in the attribute collection, as follows:
1 |
Info(OUTLOOK) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694 |
2 |
Info(TEMPERATURE) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911 |
3 |
Info(HUMIDITY) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789 |
4 |
Info(WINDY) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892 |
Based on the above data, we can calculate the information gain value that is dependent on selecting the first root node, which is calculated as follows:
1 |
Gain(OUTLOOK) = Info(D) - Info(OUTLOOK) = 0.940 - 0.694 = 0.246 |
2 |
Gain(TEMPERATURE) = Info(D) - Info(TEMPERATURE) = 0.940 - 0.911 = 0.029 |
3 |
Gain(HUMIDITY) = Info(D) - Info(HUMIDITY) = 0.940 - 0.789 = 0.151 |
4 |
Gain(WINDY) = Info(D) - Info(WINDY) = 0.940 - 0.892 = 0.048 |
Next, we calculate the split information metric h (V):
Property Outlook has 3 values, where Sunny has 5 samples, rainy has 5 samples, overcast has 4 samples, then
1 |
H(OUTLOOK) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345 |
Attribute temperature has 3 values, in which hot has 4 samples, mild has 6 samples, cool has 4 samples, then
1 |
H(TEMPERATURE) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228 |
Attribute humidity has 2 values, where normal has 7 samples and high has 7 samples, then
1 |
H(HUMIDITY) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0 |
Attribute windy has 2 values, where true has 6 samples, false has 8 samples, then
1 |
H(WINDY) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516 |
Based on the results above, we can calculate the information gain rate as follows:
1 |
IGR(OUTLOOK) = Gain(OUTLOOK) / H(OUTLOOK) = 0.246/1.577406282852345 = 0.15595221261270145 |
2 |
IGR(TEMPERATURE) = Gain(TEMPERATURE) / H(TEMPERATURE) = 0.029 / 1.5566567074628228 = 0.018629669509642094 |
3 |
IGR(HUMIDITY) = Gain(HUMIDITY) / H(HUMIDITY) = 0.151/1.0 = 0.151 |
4 |
IGR(WINDY) = Gain(WINDY) / H(WINDY) = 0.048/0.9852281360342516 = 0.048719680492692784 |
According to the obtained information gain rate, the attribute in the selection attribute set is used as the decision tree node, and the node is split.
Summarize
The advantage of C4.5 algorithm is that the classification rules are easy to understand and the accuracy rate is high.
The disadvantage of the C4.5 algorithm is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm.
Classification algorithm: Decision Tree (C4.5) (RPM)