Classification algorithm: Decision Tree (C4.5) (RPM)

Last Update:2017-07-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

C4.5 is another classification decision tree algorithm in machine learning algorithm, it is an important algorithm based on ID3 algorithm, the improvement has the following key points compared to the ID3 algorithm:

1) Use the information gain rate to select attributes. ID3 Select attributes are subtree information gain, there are many ways to define information, ID3 use entropy (entropy, entropy is a measure of purity), that is, entropy change value, and C4.5 with the information gain rate.

2) Pruning in the decision tree construction process, because some nodes with very few elements may make the constructed decision tree too adaptive (Overfitting), which may be better if these nodes are not considered.

3) can also be processed for non-discrete data.

4) Ability to process incomplete data .

How is the information gain rate calculated?
Once you are familiar with the ID3 algorithm, you already know how to calculate the information gain, and the formula is as follows (from Wikipedia):

Or, use another formula that is more intuitive and easy to understand:

The attribute set a of training dataset D is divided according to the class label, and the information entropy is obtained:

A set of information entropy is obtained by dividing each attribute in attribute set a:

Calculate information gain

Then the information gain is computed, that is, the former is bad for the latter, and a set of information gain is obtained for the attribute set a:

In this way, the information gain is calculated.

Calculate information gain Rate

Below, the formula for calculating the information gain rate is as follows (from Wikipedia):

Where IG represents the information gain, which is calculated according to the process we described earlier. and IV is what we need to calculate now, it is a measure to consider splitting information, the splitting information used to measure the breadth and uniformity of the attribute splitting data, the formula is as follows (from Wikipedia):

To simplify, look at the following formula more intuitively:

where v represents the full value of a property in the property collection A.

Example analysis

We illustrate how the C4.5 algorithm calculates the information gain and chooses the decision node using a typical example of a training dataset D, which has been quoted many times.

The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.

We have calculated the information gain, which is listed here directly as follows:

DataSet D contains 14 training samples, of which 9 are of category "Yes" and 5 for category "No", the entropy of information is calculated:

`1`	`Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940`

The information entropy is computed separately for each attribute set in the attribute collection, as follows:

`1`	`Info(OUTLOOK) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694`

`2`	`Info(TEMPERATURE) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911`

`3`	`Info(HUMIDITY) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789`

`4`	`Info(WINDY) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892`

Based on the above data, we can calculate the information gain value that is dependent on selecting the first root node, which is calculated as follows:

`1`	`Gain(OUTLOOK) = Info(D) - Info(OUTLOOK) = 0.940 - 0.694 = 0.246`

`2`	`Gain(TEMPERATURE) = Info(D) - Info(TEMPERATURE) = 0.940 - 0.911 = 0.029`

`3`	`Gain(HUMIDITY) = Info(D) - Info(HUMIDITY) = 0.940 - 0.789 = 0.151`

`4`	`Gain(WINDY) = Info(D) - Info(WINDY) = 0.940 - 0.892 = 0.048`

Next, we calculate the split information metric h (V):

Outlook properties

Property Outlook has 3 values, where Sunny has 5 samples, rainy has 5 samples, overcast has 4 samples, then

`1`	`H(OUTLOOK) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345`

Temperature property

Attribute temperature has 3 values, in which hot has 4 samples, mild has 6 samples, cool has 4 samples, then

`1`	`H(TEMPERATURE) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228`

Humidity Property

Attribute humidity has 2 values, where normal has 7 samples and high has 7 samples, then

`1`	`H(HUMIDITY) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0`

Windy Property

Attribute windy has 2 values, where true has 6 samples, false has 8 samples, then

`1`	`H(WINDY) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516`

Based on the results above, we can calculate the information gain rate as follows:

`1`	`IGR(OUTLOOK) = Gain(OUTLOOK) / H(OUTLOOK) = 0.246/1.577406282852345 = 0.15595221261270145`

`2`	`IGR(TEMPERATURE) = Gain(TEMPERATURE) / H(TEMPERATURE) = 0.029 / 1.5566567074628228 = 0.018629669509642094`

`3`	`IGR(HUMIDITY) = Gain(HUMIDITY) / H(HUMIDITY) = 0.151/1.0 = 0.151`

`4`	`IGR(WINDY) = Gain(WINDY) / H(WINDY) = 0.048/0.9852281360342516 = 0.048719680492692784`

According to the obtained information gain rate, the attribute in the selection attribute set is used as the decision tree node, and the node is split.

Summarize

The advantage of C4.5 algorithm is that the classification rules are easy to understand and the accuracy rate is high.
The disadvantage of the C4.5 algorithm is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm.

Classification algorithm: Decision Tree (C4.5) (RPM)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Classification algorithm: Decision Tree (C4.5) (RPM)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support