C4.5 algorithm learning of decision tree

Source: Internet
Author: User
Tags id3

Decision Tree <decision tree> is a pre-measured model, which consists of three parts: decision node, branch and leaf node.

A decision node represents a sample test, which typically represents a property of a sample to be categorized, and the different test results on that attribute represent a branch; a branch represents a different value for a decision node. Each leaf node represents a possible classification result.

Using the training set to train the decision tree algorithm, a decision tree model is obtained. When the model is inferred for categories of unknown samples (category unknown). From the decision tree root node, search from top to bottom until the leaf node is reached along a branch, and the category label of the leaf node is the category of the unknown sample.

There is a sample on the web that can graphically illustrate the process of making decisions using decision trees (the process by which mothers choose objects for their daughters). For example, as seen in:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvdtaxmdq5ody5ng==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

Daughter: How old are you?
Mother:.
Daughter: Long handsome not handsome?
Mother: Very handsome.


Daughter: Is the income high?
Mother: Not very high. Medium condition.
Daughter: Is it a civil servant?
Mother: Yes, I work in the Inland Revenue Department.
Daughter: Well, I'll meet you.

Look at the example: the DataSet, for example, has 14 samples in common, and each sample has 4 properties, each representing the weather and temperature. Humidity, whether it is windy or not. The last column represents the classification results and can be understood as suitability for outing (play).


The following is a decision tree built using the above sample:


According to the model built, when another sample <outlook = rainy, temperature = Cool,humidity = Normal Windy = True> then we can start from the root node down to the last obtained: no play.

Under careful consideration, this is somewhat similar to the construction tree process in the fp-tree algorithm. But never the same. In fact, with the same data set, we can build very many decision trees and not necessarily outlook as the root node. Fp-tree simply stores all of the sample information on a single tree. The decision tree obviously has a process of selecting the attribute of the node to classify.

So that's the problem? How do you select attributes as categorical attributes and divide the samples into smaller subsets? When to end the growth of the decision tree , the decision tree is constructed to classify the training samples accurately. And for the unknown sample (test specimen) can also be accurately predicted, the possible strategy is that all the samples belong to the same category or all the sample attribute values are equal.

Different decision tree algorithms adopt different strategies, the following mainly introduce the C4.5 algorithm, the main learning C4.5 select nodes to divide the subset of the strategy.

The C4.5 algorithm was proposed by Professor Ross Quinlan of the University of Sydney, Australia, in 1993, based on the improvement of the ID3 algorithm, which can handle data of continuous or discrete attributes, can handle attribute data with missing values, and uses information gain rate rather than information gain as the attribute selection criteria for decision trees. ; Pruning of the resultant branches. Reduced overfitting.

For example, the following is the decision Tree algorithm framework:

Treegrowth (E, f)//e--training set  F-Property set   if Stopping_cond (E, f) = True Then     //Reach stop split condition (subset all samples same as one or other)      leaf = CreateNode ()                 //Build leaf node      Leaf.label = classify (E)            //Leaf node category label      return leaf   else<span style= " White-space:pre ">      root = CreateNode () <span style=" White-space:pre ">//create node      root.test_cond = Find_  Best_split (E, F)    determines which attribute is selected as a smaller subset//      Order v = {V | v is a possible output of the Root.test_cond} for every      v do         Ev = {E | Root.test_cond (e)  = V and e  e}         child = Treegrowth (Ev, F)         //children are added as root and the Edge (Root-->child) Mark as V       end for   End If   return root


Main Process: First, a given data set is represented by the root node, and then a property is selected on each node from the root node (containing the root node) to divide the node dataset (a tree is split into several trees) to a smaller subset (subtree), until a property is used. All the samples in its sub-set belong to one category. Before they stop splitting.

And how the nodes select attributes. That's what C4.5 to do.

As mentioned earlier: C4.5 uses the information gain rate rather than the information gain as the attribute selection criteria for the decision tree. The following is a gradual explanation from the beginning of entropy:

Entropy: The explanation of entropy in information theory. Entropy determines the minimum number of bits required to encode random members in a set S

Pi is the proportion of Class I in the set S. (see later examples for a detailed example)

Understand "minimum":

1. For 2 classification issues. Descriptive narratives with a bits of 1 and 0 are sufficient to classify, for 4 classification problems. Need at least two bits descriptive narrative 11;c classification log2 (c)

2. For classification issues. Also consider the scale of the class (sample imbalance problem). In m+n samples, the number of bits represented by the M-sample classification is not the same as that of n bits, and the definition of all entropy has a thought of weighted average.

Simply put, it depicts the purity of the random sample set, the more pure, the smaller the entropy.

In other words, "the uncertainty about variables is greater." The greater the entropy, the more orderly a system, the lower the entropy of information (Baidu Encyclopedia)

For the two classification problem, the entropy is between [0,1], assuming that all the samples belong to the same class. The entropy is 0, and given a sample at this time, the category is determined. Suppose that the different samples are in half and the entropy is 1=1/2+1/2. This time, assuming that a given kind of classification is completely impossible to determine, it is as if we are tossing a coin at all to predict whether it is positive or negative.

For the C classification problem, the entropy is between [0 log2 (c)].

Several formulas used in the C4.5:

1 Information entropy of the training set


M represents the number of categories. Pi is the proportion of the total number of samples in each category in the data set.

2 Divide the information entropy ----If you select attribute A to divide the dataset S. Calculates attribute A's division information entropy value for set S

Case 1: A is a discrete type with k different values. Divide s into k subsets {S1 s2 ... sk}, according to the K different values of the attributes, then attribute a divides the information entropy of S into: (Middle |   si| | s| Indicates the number of samples included)

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvdtaxmdq5ody5ng==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

Case 2: A is a continuous type of data, it is sorted by the value of attribute A , and the midpoint of each pair of adjacent values is treated as a possible split point for each possible split point. Calculation:


Of The SL and SR respectively correspond to a subset of the left and right parts of the dividing point. Select the split point with the smallest entropya(S) value as the best split point for attribute a , and divide the entropy value of the set S by attribute a in the best split point as attribute a divide the entropy value of S .

3 Information Gain

by attribute APartitioning data sets SThe information gain gain (S,A) is the sample set SThe entropy minus by attribute ADivided SAfter the entropy of the sample subset, i.e.


4 Splitting information

To adjust the information gain by using the splitting information of the introduced attribute


5 information Gain rate


The information gain rate divides the information as the denominator. The larger the number of attribute values, the greater the value of the split information, which partially offsets the effect of the number of attribute values.

Compared with the ID3 directly using the information entropy gain to select the best properties, to avoid a certain attribute has more classification value and thus have a large information entropy, so easier to be selected as the classification of attributes.

The formula is slightly more dazzling. In fact, to get the information gain rate.

The following is an example of a weather dataset introduced by the blog, for attribute selection.

What the detailed process sees:


watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvdtaxmdq5ody5ng==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">


watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvdtaxmdq5ody5ng==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

The root node is selected for example by following the Outlook attribute:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvdtaxmdq5ody5ng==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

Recursion is performed as above procedure. You get the decision tree at the beginning of the blog.

This paper cites some of the content in the application technology and practice of data mining and machine learning Weka, and changes the error of decision tree calculation in the original book, and the information gain rate of Outlook in the book is 0.44 wrong.

Reprint please declare: http://blog.csdn.net/u010498696/article/details/46333911

References: "Data mining and machine learning Weka application technology and practice"

C4.5 algorithm learning of decision tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.