1. Introduction to the algorithm background

The classification tree (decision tree) is a very common classification method. He is a kind of supervised learning, so-called regulatory learning is simple, that is, given a bunch of samples, each sample has a set of attributes and a category, these categories are predetermined, then by learning to get a classifier, the classifier can give the new object the correct classification. Such machine learning is called supervised learning. Classification is essentially the process of a map. C4.5 Classification Tree is the most popular one in decision tree algorithm. Here is a data set as the basis for the algorithm example, such as a data set, as follows:

This golf data set is the foundation of our discussion of this blog post. The purpose of our classification is to determine whether the day is suitable for golfing based on weather conditions such as weather, temperature, humidity, or wind.

2. Algorithm description

Specifically, C4.5 is not a single algorithm, but a set of algorithms, C4.5 has a number of functions, each function corresponds to an algorithm, these functions together formed a set of algorithms is C4.5. C4.5 Classification Tree Construction algorithm framework such as:

Figure 1

The framework of the algorithm is still relatively clear, starting from the root node constantly scoring, recursion, growth, until the final result. The root node represents the entire set of training samples, and the algorithm recursively divides the dataset into smaller datasets by testing the properties on each node. The subtree corresponding to a node corresponds to a subset of the data set that satisfies a property test in the original dataset. The recursive process continues, Until a corresponding subtree of a node corresponds to the same class as the data set. The decision trees corresponding to the golf dataset are as follows:

Figure 1 gives the framework of the C4.5 classification tree construction algorithm, some details are not elaborated. Here is a detailed analysis of the problems that may be faced:

**What are the tests in the classification tree?**

Tests in a classification tree are tests that are performed on a sample property. We know that there are two kinds of properties of a sample, one is a discrete variable and one is a continuous variable. For discrete variables, this is simple, the discrete variable corresponds to multiple values, each corresponds to a branch of the test, and the test verifies which branch of the corresponding property value corresponds to the sample. The data set is then divided into several groups. For continuous variables, the test branch for all continuous variables is 2, and the test branch corresponds to the branch threshold, and we'll discuss how this branch threshold is chosen.

**How do I choose a test?**

Each node in the classification tree corresponds to the test, but how are these tests chosen? C4.5 chooses tests according to the information theory criteria, such as gain (in information theory, entropy corresponds to the amount of information in a given distribution, and its value corresponds to the minimum number of bits required to represent the distribution completely without loss, in essence the entropy corresponds to the uncertainty, the richness of the possible variation. The so-called gain, that is, after the application of a test, its corresponding probability of the richness of the decline, uncertainty, the decrease is the gain, which essentially corresponds to the benefits of classification, or the gain ratio (this indicator is actually equal to the gain/entropy, The use of this indicator is to overcome the disadvantage of using gain as a measure, and the use of gain as a measure will lead the classification tree to prefer to choose those tests with more branches, which need to be suppressed. When the algorithm is tree-growth, it is always "greedy" to choose those tests with the highest information standards.

**How do I choose a threshold for a continuous variable?**

What are the tests in the classification tree? The threshold point for the branch of the continuous variable is, how is this threshold determined? It is very simple to sort the sample (corresponding root node) or subset of samples (corresponding subtree) according to the size of continuous variables from small to large, assuming that the property corresponds to a total of N of different property values, then there is a total of N-1 possible candidate segmentation threshold point, The value of each candidate's segmentation threshold point is the midpoint of the successive elements in the list of the sorted attribute values listed above, then our task is to select one of the N-1 candidate segmentation threshold points, so that the previously mentioned information standards are the largest. For example, for the golf dataset, we handle the temperature attribute to select the appropriate threshold value. First, the corresponding sample is sorted according to the temperature size as follows

Then you can see 13 possible candidate thresholds, such as middle[64,65], middle[65,68]....,middle[83,85]. So what's the optimal threshold? Should be middle[71,72], as shown in the red line. Why is it? Calculated as follows:

The above calculation means that 0.939 is the largest, so the test gain is minimal. (The test gain is inversely proportional to the entropy of the test, which can be clearly seen from the following formula). According to the above description, we need to calculate the gain or entropy of each candidate segmentation threshold to get the optimal threshold, we need to calculate the N-1 gain or entropy (corresponding to the temperature of this variable is 13 calculations). Can it be improved? Count a few times and speed up. The answer is that you can go in, like

The Green Line in the graph represents the possible optimal segmentation threshold point, according to information theory knowledge, such as middle[72,75] (red line) of this segmentation point, 72,75 belong to the same class, such a split point is impossible to have information gain. (The same class is divided into different classes, so the threshold point obviously does not have information gain, because such a classification does not help, reduce the likelihood)

**tree-growth** **How to terminate?**

As mentioned earlier, Tree-growth is actually a recursive process, then when does this recursion arrive at the terminating condition to exit the recursion? There are two ways, the first way is if the branch of a node is covered by a sample of the same class, then recursion can be terminated, The branch produces a leaf node. Another way is that if the number of samples covered by a branch is less than a threshold, a leaf node can also be produced, thus terminating tree-growth.

**How do I determine the class of a leaf node?**

As mentioned above, there are 2 ways to terminate the tree-growth, for the first way, the leaf node covers the same class of samples, then in this case the leaf node class naturally needless to say. For the second way, the leaf node covers the sample may not belong to the same class, a direct way is that the leaf node is covered by the sample which class is the majority, then the leaf node category is the majority of the class.

**How do I choose the test before? **The choice of test is based on the information theory standard, there are two kinds of information theory, one is gain, one is gain ratio. First, let's look at the calculation of gain gain. Suppose that the random variable x, which may belong to any of the class C, through the statistics of the sample, it belongs to each class of the probability of each, then want to classify a sample of the required entropy is

**Equation 1**

For the golf data set, {Playgolf? The entropy of the} is based on the above formula, which is

The value of the above formula is 0.940. Its information theory means that I want to playgolf? This message is communicated to others, on average I need at least 0.940 bit to pass this message. The goal of C4.5 is to reduce this entropy by classifying it. So let's consider each property test in turn, and with a property test we divide the sample into subsets, which makes the sample progressively more orderly, so the entropy must be smaller. This reduction in entropy is the basis of our choice of attribute testing. In the case of the golf dataset, for example, Outlook's gain gain (Outlook) has the following formula:

The essence of this is to divide the DataSet D into a V subset based on a property test, which makes the dataset D more orderly, making the entropy of the dataset D smaller. The entropy in the group is actually the weight of the entropy of each subset. By calculating we get Gain (Outlook) =0.940-0.694=0.246,gain (windy) =0.940-0.892=0.048 ....

You can get the first Test property that is Outlook. It is important to note that the property test is selected from a candidate attribute consisting of all the attributes contained in the dataset. For the attributes contained on the path to the root node of the node (which we call the Inheritance property), it is very easy to get their entropy gain of 0 according to the formula, so these inherited attributes are completely unnecessary and can be removed from the candidate attributes.

It seems like everything is perfect here, the gain indicator is very good sense, but in fact the gain is a disadvantage. Let's consider the day property in the golf dataset (we assume it's a true attribute, and it's very likely that you don't think of it as a property), day has 14 different values, and day's property test node has 14 branches, and it's clear that each branch actually covers a "pure" dataset (so-called " Pure "refers to the data set that is covered by the same class), then the entropy gain is obviously the largest, then day is the default as the first attribute. The reason for this is that the gain indicator is naturally biased towards selecting attributes that have more branches than those that have more values. This bias makes us want to overcome, and we want to evaluate all attributes impartially. Thus another indicator is proposed for the gain ratio-gain ratio. The gain ratio formula for a property A is as follows:

Gain (a) is the gain previously calculated, and Splitinfo (a) is calculated as follows:

Equation 2

Comparing with equation 1, you will find that Splitinfo (a) is actually the entropy of attribute A, but this entropy is different, he is not the final classification of the sample {Playgolf? }, but the corresponding attribute group {A?} Entropy, which reflects the amount of information in attribute a itself. By calculating we can easily get Gainratio (Outlook) =0.246/1.577=0.156. The gain ratio is actually a normalization of the gain, which avoids the tendency of the indicator to favor branching properties.

The decision tree was obtained by the above method, which all seemed perfect, but that was not enough. Decision trees can help us classify new samples, but there are still some problems that can't be solved well. For example, we want to know which attribute contributes more to the final classification? Can you use a more concise rule to distinguish which class A sample belongs to? Wait a minute. In order to solve these problems, based on the resulting decision tree, you and Daniel have put forward some new methods.

3. Function of C4.5

3.1 Pruning of decision trees

Why should the decision tree be pruned? The reason is to avoid "over-fitting" the decision tree sample. The decision tree generated by the preceding algorithm is very detailed and large, and each attribute is considered in detail, and the training samples covered by the leaf nodes of the decision tree are "pure". So using this decision tree to classify the training samples, you will find that the tree is perfect for the training sample, which can be 100% perfectly correct to classify the samples in the training sample set (because the decision tree itself is the product of the 100% Perfect fit training sample). However, this can lead to a problem, if the training sample contains some errors, according to the previous algorithm, these errors will be 100% not left to be learned by the decision tree, which is "over-fitting." The founder of C4.5, Professor Quinlan, who had discovered the problem early on, had experimented with the fact that in one dataset, the error rate of an over-fitting decision tree was higher than that of a simplified decision tree. So now the question is, how to build a simplified decision tree by pruning on the basis of the native overfitting decision tree?

The first method, and the simplest method, is called the pruning based on miscalculation. This idea is straightforward, the complete decision tree is not over-fitting, and I'm going to get a test data set to correct it. For each subtree of a non-leaf node in the full decision tree, we try to replace it with a leaf node, the category of the leaf node, which is replaced by the most existing class in the training sample covered by the subtree, thus creating a simplified decision tree and comparing the performance of the two decision trees in the test data set, If the simplification of the decision tree in the test dataset is less error, and the subtree does not contain another similar characteristics of the subtree (so-called similar characteristics, refers to the tree replaced with a leaf node, the test data set of the characteristics of the lower rate of miscalculation), then the subtree can be replaced by leaf nodes. The algorithm iterates through all subtrees in a bottom-up manner until no subtree can be replaced and the performance of the test data set is improved, the algorithm can be terminated.

The first method is straightforward, but requires an additional test data set, can you not do this extra data set? In order to solve this problem, we put forward the pessimistic pruning. The method of pruning is based on the sample miscarriage rate in the training sample set. We know that each node of a classification tree is covered by a sample set, according to the algorithm of these covered sample set often have a certain rate of miscarriage, because if the node covers the number of sample sets less than a certain threshold, then the node will become a leaf node, so the leaf node will have a certain rate of miscarriage. Each node will contain at least one leaf node, so each node will also have a certain rate of miscarriage. Pessimistic pruning is the recursive estimation of the false rate of the sample nodes covered by each internal node. After pruning, the inner node becomes a leaf node, and the category of the leaf node is determined by the optimal leaf node of the original internal node. Then the error rate of the node before and after pruning is compared to determine whether to prune. The method is consistent with the first approach mentioned earlier, and the difference is in estimating the error rate of the internal nodes of the classification tree before pruning.

The idea of pessimistic pruning is very ingenious. A sub-tree (with multiple leaf nodes) of the classification with a leaf node to replace, the rate of miscarriage is certainly rising (it is obvious that the same sample subset, if the sub-tree classification can be divided into multiple classes, and a single leaf node to divide words can only be divided into one class, multiple classes must be accurate). So we need to calculate the miscalculation of the tree and add an empirical penalty factor. For a leaf node, which covers N samples with an E error, the error rate for the leaf node is (e+0.5)/n. This 0.5 is a penalty factor, then a subtree, which has a leaf node of L, then the false rate of this subtree is estimated. In this way, we can see that a subtree has multiple sub-nodes, but due to the addition of a penalty factor, the miscalculation rate of the subtree may not be cheap. After pruning, the internal node becomes the leaf node, and the number of false errors J also needs to add a penalty factor to become j+0.5. Whether or not the subtree can be pruned depends on the error j+0.5 in the standard error after pruning.

For sample error rate E, we can estimate it as a variety of distribution models based on experience, such as a two-item distribution, such as a normal distribution. We take the two-item distribution as an example, a few words to analyze. What is the two-item distribution, in the N independent repetition test, the number of occurrences of event A is x, if the probability of the occurrence of event A in each test is P, then in the N independent repetition test, the probability that event a happens K times is

Its probability expectation is NP and the variance is NP (1-P). For example, the dice is a typical two-item distribution, casting Dice 10 times, throw 4 points of the number of times to obey the N=10,P=1/6 two distribution.

If the two-item distribution is n=1, which is only counted once, event A can have only two values of 1 or 0, then the value of event a represents the distribution of the Bernoulli distribution. B (1,p) ~f (1;1,p) is the Bernoulli distribution, Bernoulli distributed is a special form of two distribution. For example, a coin, a positive value of 1, negative value of 0, then the coin is a positive probability of p=0.5, the value of the coin will obey the probability of 0.5 of the Bernoulli distribution.

When n tends to be infinitely large, the two-item distribution is normally distributed, such as.

Then a tree error classification of a sample value of 1, the correct classification of a sample value of 0, the probability of the tree error classification (the rate of miscarriage) is E (E is the intrinsic properties of the distribution, can be statistically), then the number of false errors of the tree is the Bernoulli distribution, we can estimate the number of errors mean and variance:

When the tree is replaced with a leaf node, the number of false positives of the leaves is also a Bernoulli distribution, the probability of the false rate E is (e+0.5)/n, so the number of false errors of the leaf node is

Then, if the subtree can be substituted by a leaf node, it must meet the following criteria:

This condition is the standard of pruning. According to the confidence interval, we set a certain significance factor, we can estimate the number of errors in the upper and lower bounds.

The number of false positives can also be estimated as a normal distribution, and interested people can deduce it.

3.2 Improvement of continuous value properties

The classification tree algorithm tends to select those continuous-valued attributes relative to those of discrete-valued attributes, because the continuous-valued attribute has more branches and the entropy gain is the largest. The algorithm needs to overcome this tendency. Remember how you talked about overcoming the discrete attributes of the classification tree algorithm that tend to have more discrete values? Yes, we use the gain rate to overcome this tendency. The gain rate can also be used to overcome the tendency of continuous-valued attributes. The gain rate as the basis for selecting attributes overcomes the tendency of continuous-valued attributes, which is no problem. However, if the gain rate is used to select the dividing point of the continuous value attribute, some side effects are caused. The dividing point divides the sample into two parts, and the ratio of the number of samples in the two sections also affects the gain rate. According to the gain rate formula, we can find that when the cutoff point can divide the sample into two subsets of equal number (we call the dividing point at this time as the dividing point), the gain rate inhibition will be maximized, so the dividing point is excessively suppressed. The number of subset samples can affect the demarcation point, which is obviously unreasonable. Therefore, in determining the demarcation point is still the use of the gain indicator, and the selection of properties only when the gain rate this indicator. This improvement is good for suppressing the tendency of continuous-valued attributes. Of course, there are other ways to suppress this tendency, such as MDL, an interested reader who can read the relevant articles on their own.

3.3 Handling Missing attributes

What if some of the training samples or samples are missing some attribute values? To solve this problem, there are 3 questions to consider: i) what should I do if some of the training samples are missing some of the property values when I start deciding which attribute to use for branching? II) If a property has been selected, what should be handled if some of the samples are missing when deciding on the branch? III) When the decision tree has been generated, but the samples to be categorized are missing some attributes, what are the properties to handle? In view of these three problems, Quinlan put forward a series of ideas and methods to solve.

For problem I), when calculating the gain or gain rate of a property A, if some samples do not have attribute A, then there are several ways to handle it: (I) ignore the sample of these missing attribute a. (C) give the sample of the missing attribute a a mean value or the most commonly used value of the property A. (R) When calculating gain or gain rate, the gain/gain rate is correspondingly "discounted" based on the ratio of the number of missing attribute samples. (S) Find ways to complement the missing attributes of these samples based on other unknown attributes.

For Problem II), when attribute A is already selected and the sample is branched, if some of the samples are missing attribute A, then: (I) ignore the samples. (C) Assign the attribute A of these samples to a mean or most commonly occurring value before they are processed. (R) The missing samples of these attributes are assigned to each subset according to the relative ratio of the number of subsets of samples that have the sample of attribute a divided. As to which missing samples are divided into subsets 1, which are divided into subset 2, this does not have a certain criterion that can be randomly moved. (A) Assign the attribute missing sample to all subsets, which means that each subset has a missing sample of these attributes. (U) separate a subset of branches for a sample of attributes that are missing. (S) for a sample of missing attribute A, try assigning a value of attribute A to him based on other attributes, and then continue processing to divide it into the appropriate subset.

For question III), there are several options for a sample to be categorized for a missing attribute A: (U) If there is a separate, true branch, depending on the branch. (c) Assigning the attribute a value of the sample to be categorized as the attribute value of the most common occurrence of a, and then making a branch prediction. (S) Populates an attribute a value for the sample to be classified according to other attributes, and then branch-processing. (F) on the branch of the attribute a node in the decision tree, traverse all branches of the attribute a node, explore all possible classification results, and then combine the results of these classifications to consider a classification based on probability. (H) The sample to be classified terminates the classification when it arrives at the attribute a node, and assigns a class with the highest probability of occurrence according to the leaf node category condition covered by the a node at this time.

3.4 Inference Rules

C4.5 decision tree can generate a series of rule sets according to decision Tree, we can consider a decision tree as a combination of a series of rules. A rule corresponds to the path from the root node to the leaf node, and the condition of the rule is the condition on the path, and the result is the category of the leaf node. C4.5 first generates a rule set based on each leaf node of the decision tree, and for each rule in the rule set, the algorithm uses the "mountain climbing" search to try to see if conditions can be removed, since removing a condition and pruning an internal node is essentially the same, So the pessimistic pruning algorithm mentioned earlier is also used here to simplify the rules. The MDL guidelines can also be used here to measure the amount of information encoded in a rule and to sort the underlying rules. The number of simplified rules is much smaller than the number of leaf nodes in the decision tree. The original decision tree cannot be reconstructed according to the simplified rule set. Rule sets are more actionable than decision trees, so in many cases we need to infer the ruleset from the decision tree. One drawback of C4.5 is that if the dataset grows a bit, the learning time will grow rapidly.

4 Available C4.5 Packages

C4.5 Decision Tree Algorithm so

The original implementation of C4.5 can be obtained from Professor Quinlan's personal webpage http://www.rulequest.com/Personal/, but this is a C language version, and this version is not free and you must follow his business license requirements. Open source mlc++ and Weka all have the realization of C4.5, you can refer to. There are also some commercial versions of implementations available, such as Odbcmine.

C4.5 is an early machine learning algorithm that is widely used in industry.

Reference documents

[1]http://zh.wikipedia.org/zh-cn/two-item distribution

[2]http://wenku.baidu.com/view/382a9c2558fb770bf78a5560.html

[3] The top ten algorithms in data mining

Decision Tree algorithm-Information entropy-information gain-information gain rate-gini coefficient-turn