Classification algorithm: Decision Tree (C4.5) (RPM)

Source: Internet
Author: User

C4.5 is another classification decision tree algorithm in machine learning algorithm, it is an important algorithm based on ID3 algorithm, the improvement has the following key points compared to the ID3 algorithm:

1) Use the information gain rate to select attributes. ID3 Select attributes are subtree information gain, there are many ways to define information, ID3 use entropy (entropy, entropy is a measure of purity), that is, entropy change value, and C4.5 with the information gain rate.

2) Pruning in the decision tree construction process, because some nodes with very few elements may make the constructed decision tree too adaptive (Overfitting), which may be better if these nodes are not considered.

3) can also be processed for non-discrete data.

4) Ability to process incomplete data .

How is the information gain rate calculated?
Once you are familiar with the ID3 algorithm, you already know how to calculate the information gain, and the formula is as follows (from Wikipedia):

Or, use another formula that is more intuitive and easy to understand:

    • The attribute set a of training dataset D is divided according to the class label, and the information entropy is obtained:

    • A set of information entropy is obtained by dividing each attribute in attribute set a:

    • Calculate information gain

Then the information gain is computed, that is, the former is bad for the latter, and a set of information gain is obtained for the attribute set a:

In this way, the information gain is calculated.

    • Calculate information gain Rate

Below, the formula for calculating the information gain rate is as follows (from Wikipedia):

Where IG represents the information gain, which is calculated according to the process we described earlier. and IV is what we need to calculate now, it is a measure to consider splitting information, the splitting information used to measure the breadth and uniformity of the attribute splitting data, the formula is as follows (from Wikipedia):

To simplify, look at the following formula more intuitively:

where v represents the full value of a property in the property collection A.

Example analysis

We illustrate how the C4.5 algorithm calculates the information gain and chooses the decision node using a typical example of a training dataset D, which has been quoted many times.

The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.


We have calculated the information gain, which is listed here directly as follows:

DataSet D contains 14 training samples, of which 9 are of category "Yes" and 5 for category "No", the entropy of information is calculated:

1 Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940

The information entropy is computed separately for each attribute set in the attribute collection, as follows:

1 Info(OUTLOOK) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694
2 Info(TEMPERATURE) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911
3 Info(HUMIDITY) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789
4 Info(WINDY) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892

Based on the above data, we can calculate the information gain value that is dependent on selecting the first root node, which is calculated as follows:

1 Gain(OUTLOOK) = Info(D) - Info(OUTLOOK) = 0.940 - 0.694 = 0.246
2 Gain(TEMPERATURE) = Info(D) - Info(TEMPERATURE) = 0.940 - 0.911 = 0.029
3 Gain(HUMIDITY) = Info(D) - Info(HUMIDITY) = 0.940 - 0.789 = 0.151
4 Gain(WINDY) = Info(D) - Info(WINDY) = 0.940 - 0.892 = 0.048

Next, we calculate the split information metric h (V):

    • Outlook properties

Property Outlook has 3 values, where Sunny has 5 samples, rainy has 5 samples, overcast has 4 samples, then

1 H(OUTLOOK) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345
    • Temperature property

Attribute temperature has 3 values, in which hot has 4 samples, mild has 6 samples, cool has 4 samples, then

1 H(TEMPERATURE) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228
    • Humidity Property

Attribute humidity has 2 values, where normal has 7 samples and high has 7 samples, then

1 H(HUMIDITY) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0
    • Windy Property

Attribute windy has 2 values, where true has 6 samples, false has 8 samples, then

1 H(WINDY) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516

Based on the results above, we can calculate the information gain rate as follows:

1 IGR(OUTLOOK) = Gain(OUTLOOK) / H(OUTLOOK) = 0.246/1.577406282852345 = 0.15595221261270145
2 IGR(TEMPERATURE) = Gain(TEMPERATURE) / H(TEMPERATURE) = 0.029 / 1.5566567074628228 = 0.018629669509642094
3 IGR(HUMIDITY) = Gain(HUMIDITY) / H(HUMIDITY) = 0.151/1.0 = 0.151
4 IGR(WINDY) = Gain(WINDY) / H(WINDY) = 0.048/0.9852281360342516 = 0.048719680492692784

According to the obtained information gain rate, the attribute in the selection attribute set is used as the decision tree node, and the node is split.

Summarize

The advantage of C4.5 algorithm is that the classification rules are easy to understand and the accuracy rate is high.
The disadvantage of the C4.5 algorithm is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm.

Classification algorithm: Decision Tree (C4.5) (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.