C4.5 algorithm learning of decision tree

Source: Internet
Author: User
Tags id3

Decision Tree <decision tree> is a predictive model which consists of three parts: decision node, branch and leaf node. A decision node represents a sample test, which typically represents a property of a sample to be categorized, the different test results on that property represent a branch, and a branch represents a different value for a decision node. Each leaf node represents a possible classification result.

Using the training set to train the decision tree algorithm, a decision tree model is obtained, and when the model is used to judge the class of unknown sample (category unknown), it starts from the decision tree node and searches from top to bottom until the leaf node is reached along a branch, and the category label of the leaf node is the category of the unknown sample.

There is an example on the web that illustrates the process of making decisions using decision trees (the process by which mothers select objects for their daughters), as shown in:

Daughter: How old are you?
Mother:.
Daughter: Long handsome not handsome?
Mother: Very handsome.
Daughter: Is the income high?
Mother: Not very high, medium condition.
Daughter: Is it a civil servant?
Mother: Yes, I work in the Inland Revenue Department.
Daughter: Well, I'll meet you.

Take another example: the dataset shows a total of 14 samples, each with 4 properties, indicating weather, temperature, humidity, and wind. The last column represents the classification results, which can be understood as suitability for outing (play).


The following is a decision tree built using the above sample:


According to the model built, when another sample <outlook = rainy, temperature = Cool,humidity = Normal Windy = True> then we can start down from the root node and get the last: No play.

to think carefully, it's a bit like the tree-building process in the fp-tree algorithm, but it's never the same. In fact, with the same data set, we can build many decision trees and not necessarily outlook as the root node. Fp-tree simply stores all the sample information on a single tree, and the decision tree obviously has a selection of node attributes to classify the process.

So that's the problem? How do you select attributes as categorical attributes and divide the samples into smaller subsets? When to end the growth of the decision tree , so that the decision tree is not only accurate classification of training samples, but also for unknown samples (test samples) can be accurately predicted, the possible strategy is that all the samples belong to the same category or all the sample attribute values are equal.

Different decision tree algorithms adopt different strategies, the following mainly introduce the C4.5 algorithm, the main learning C4.5 Select the node partition subset of the strategy.

The C4.5 algorithm was proposed by Professor Ross Quinlan of the University of Sydney, Australia, in 1993, based on the improvement of the ID3 algorithm, capable of processing data for continuous or discrete attributes, capable of processing attribute data with missing values, and using information gain rate rather than information gain as the attribute selection criteria for decision trees. ; The pruning of the branches was reduced over-fitting.

The following is the decision Tree algorithm framework:

Treegrowth (E, f)//e--training set  F-Property set   if Stopping_cond (E, f) = True Then     //Reach stop split condition (subset all samples same as one or other)      leaf = CreateNode ()                 //Build leaf node      Leaf.label = classify (E)            //Leaf node category label      return leaf   else<span style= " White-space:pre ">      root = CreateNode () <span style=" White-space:pre ">//create node      root.test_cond = Find_  Best_split (E, F)    determines which attribute is selected as a smaller subset//      Order v = {V | v is a possible output of the Root.test_cond} for every      v do         Ev = {E | Root.test_cond (e)  = V and e  e}         child = Treegrowth (Ev, F)         ///Add children as root, and Edge (Root-->child) Mark as V       end for   End If   return root


Main Process: First, the root node represents a given data set, and then a property is selected on each node starting at the root node (including the root node) to divide the node dataset (a tree is divided into several trees) to a smaller subset (subtree); Until a property is used, all the samples in its subset belong to one category before the split is stopped.

And how the node chooses the attribute is exactly what C4.5 to do.

As mentioned earlier: C4.5 uses the information gain rate rather than the information gain as the attribute selection criteria for the decision tree. The following is a gradual explanation from entropy:

Entropy : An explanation of entropy in information theory, which determines the minimum number of bits required to encode the classification of any member in the set S

Pi is the proportion of Class I in the set S. (see example below for specific examples)

Understand "minimum":

1. For the 2 classification problem, a description of bits 1 and 0 is sufficient to classify; for 4 classification problems, at least two bits are required to describe the 11;C classification log2 (c)

2. For classification problems, also consider the proportion of the class (sample imbalance problem), m+n samples where the M sample classification represents the number of bits and n samples represent a different number of bits, all the definition of entropy there is a weighted average idea.

Simply put, it depicts the purity of any sample set, the more pure, the smaller the entropy.

In other words, "the greater the uncertainty of the variable, the greater the entropy, the more orderly a system, the lower the entropy of information (Baidu Encyclopedia)

For the two classification problem, the entropy is between [0,1], if all the samples belong to the same class, and the entropy is 0, this time given a sample, the category is OK. If the different samples are in half and the entropy is 1=1/2+1/2, this time, if given a kind of original classification, it is completely impossible to determine, as if we toss a coin at all can not predict whether it is positive or negative face up.

For the C classification problem, the entropy is between [0 log2 (c)].

Several formulas used in the C4.5:

1 Information entropy of the training set


where m represents the number of categories, and Pi is the proportion of the total number of samples in each category in the dataset.

2 Divide the information entropy----Assume that select attribute a divides the data set S, calculates the division information entropy value of the attribute a to the set S

Case 1: A is a discrete type, there are k different values, according to the attributes of the K different values to divide s into k subsets {S1 s2 ... sk}, then attribute a divides the information entropy of S: (where |   si| | s| Indicates the number of samples included)


Case 2: A is a continuous type of data, it is sorted by the value of attribute A , and the midpoint of each pair of adjacent values is considered a possible split point, calculated for each possible split point:


TheSL and SR respectively correspond to the left and right subsets of the split point, and select the entropya(S) value as the best splitting point of attribute a . The entropy value of the set s is divided as attribute a by attribute a to the best splitting point .

3 Information gain

by attribute APartitioning data sets SThe information gain gain (S,A) is the sample set SThe entropy minus by attribute ADivided SAfter the entropy of the sample subset, i.e.


4 Splitting information

To adjust the information gain by using the splitting information of the introduced attribute


5 Information gain Rate


Information gain rate divides information as the denominator, the higher the number of attribute values, the greater the value of split information, which partially offsets the effect of the number of attribute values.

Compared with the ID3 directly using the information entropy gain to select the best properties, to avoid a certain attribute has more classification value and thus has a large information entropy, which is more easily selected as the classification of attributes.

The formula is slightly more, see dazzling, actually is in order to obtain the information gain rate.

The following is an example of the weather data set introduced by the blog, for attribute selection.

Specific process:





The root node is divided by the following when it selects Outlook attributes:


The decision tree at the beginning of the blog is given as a recursive process.

This paper cites some of the content in the application technology and practice of data mining and machine learning Weka, and modifies the error of decision tree calculation in the original book, and the information gain rate of Outlook in the book is 0.44 wrong.

Reprint please declare: http://blog.csdn.net/u010498696/article/details/46333911

Reference: "Data mining and machine learning Weka application technology and practice"

C4.5 algorithm learning of decision tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.