C4.5 algorithm (excerpt)

Source: Internet
Author: User
Tags id3

1. Introduction to the C4.5 algorithm

C4.5 is a series of algorithms used in machine learning and data mining classification problems. Its goal is to supervise learning: Given a dataset, each tuple can be described with a set of attribute values, each of which belongs to a class in a mutually exclusive category. The goal of C4.5 is to find a mapping relationship from attribute values to categories by learning, and this mapping can be used to classify entities that are unknown to the new category.

C4.5 was proposed by J.ross Quinlan on the basis of ID3. The ID3 algorithm is used to construct decision trees. A decision tree is a flowchart-like tree structure in which each internal node (non-leaf node) represents a test on an attribute, each branch represents a test output, and each leaf node holds a class label. Once a decision tree is established, for a tuple that is not given a class designator, a path with a root node to the leaf node is tracked, and the leaf node holds the predictions for that tuple. The advantage of a decision tree is that it does not require any domain knowledge or parameter setting and is suitable for exploratory knowledge discovery.

From the ID3 algorithm, the C4.5 and cart two algorithms are derived, both of which are very important in data mining. is a typical C4.5 algorithm that produces a decision tree for a dataset.

As shown in DataSet 1, it represents the relationship between weather conditions and going golfing.

Figure 1 Data set

Figure 2 decision tree generated by C4.5 on a dataset

2. Algorithm description

C4.5 is not an algorithm, but a set of algorithms-c4.5, non-pruning C4.5 and C4.5 rules. The pseudo-code in the C4.5 will give the basic workflow of the process:

Figure 3 C4.5 algorithm Flow

We may have doubts that a tuple (dataset) itself has many properties, how do we know which property to judge first, and which property to judge next? In other words, in Figure 2, how do we know that the first property to test is Outlook, not windy? In fact, one of the concepts that can answer these questions is the attribute selection metric.

3. Attribute Selection Metrics

Attribute selection measures are also called split rules, because they determine how tuples on a given node are split. The attribute selection metric provides a rank rating for each attribute that describes a given training tuple, and the attribute with the best metric score is selected as the split attribute for a given tuple. At present, the most popular attribute selection measures are the information gain, gain rate and Gini index.

(1) Information gain

The information gain is actually the attribute selection metric used in the ID3 algorithm. It chooses the attribute with the highest information gain as the split attribute of node N. This property minimizes the amount of information required for the tuple classification in the result partition. The expected information required for the tuple classification in D is the following:

(1)

Info (D) is also called entropy.

It is now assumed that the tuple in D is divided by attribute A, and that attribute a divides d into a V different class. After this division, the information needed to obtain an accurate classification is measured by the following:

(2)

The information gain is defined as the difference between the original information requirement (that is, based on the class scale only) and the new requirement (that is, what is obtained after a division), i.e.

(3)

I think a lot of people see this place is not very good understanding, so my own study of the literature on this piece of the description, also contrasted with the above three formulas, the following say my own understanding.

In general, for a tuple with multiple attributes, it is almost impossible to separate them with a single attribute, otherwise the depth of the decision tree is only 2. As can be seen here, once we select a property a, assuming that the tuple is divided into two parts A1 and A2, because A1 and A2 can also use other attributes to then sub-divided, so it raises a new question: Next we choose which attribute to classify? The expected information required for the tuple classification of D is info (d), so similarly, when we divide D into a v subset DJ (j=1,2,..., v) by a, we classify the DJ's tuples, the desired information is info (DJ), and a total of V classes, so we re-classify the V-set, The required information is the formula (2). So, if the formula (2) is smaller, does that mean that the less information we need to classify a few sets of a is next? For a given training set, actually info (D) is fixed, so select the attribute with the most information gain as the split point.

However, the use of information gain is actually a disadvantage, that is, it is biased towards properties with a large number of values. What do you mean? That is, in a training set, the more different values a property takes, the more likely it is to take it as a splitting attribute. For example, a training set has 10 tuples, for a zodiac A, it takes 1-10 of these 10 numbers, if splitting a will be divided into 10 classes, then for each class info (D_j) = 0, the formula (2) is 0, the attribute divides the resulting information gain (3) is the largest, but obviously, This division is meaningless.

(2) Information gain rate

It is for this reason that the C4.5 behind ID3 uses the concept of information gain rate. Information gain rate normalizes the information gain using the split information value. The classification information is similar to info (D) and is defined as follows:

(4)

This value represents the information generated by dividing the training DataSet D into a V partition corresponding to the V output of the property a test. Information gain rate Definition:

(5)

Select the attribute with the maximum gain rate as the split attribute.

(3) Gini indicator

The Gini indicator is used in the cart. Gini metric data partitioning or training tuple set D's purity, defined as:

(6)

Here the data set (both discrete values, for continuous values, described below) see the information Gain rate node selection:

The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.
DataSet D contains 14 training samples, of which 9 are of category "Yes" and 5 for category "No", then the entropy of information is calculated: The value of the formula (1)

1 Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940

The information entropy is computed separately for each attribute set in the attribute collection, as follows:

1 Info(OUTLOOK) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694
2 Info(TEMPERATURE) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911
3 Info(HUMIDITY) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789
4 Info(WINDY) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892

Based on the above data, we can calculate the information gain value that is dependent on selecting the first root node, which is calculated as follows:

1 Gain(OUTLOOK) = Info(D) - Info(OUTLOOK) = 0.940 - 0.694 = 0.246
2 Gain(TEMPERATURE) = Info(D) - Info(TEMPERATURE) = 0.940 - 0.911 = 0.029
3 Gain(HUMIDITY) = Info(D) - Info(HUMIDITY) = 0.940 - 0.789 = 0.151
4 Gain(WINDY) = Info(D) - Info(WINDY) = 0.940 - 0.892 = 0.048

Next, we calculate the split information metric Splitinfo, which is recorded as H (V):

    • Outlook properties

Property Outlook has 3 values, where Sunny has 5 samples, rainy has 5 samples, overcast has 4 samples, then

1 H(OUTLOOK) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345
    • Temperature property

Attribute temperature has 3 values, in which hot has 4 samples, mild has 6 samples, cool has 4 samples, then

1 H(TEMPERATURE) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228
    • Humidity Property

Attribute humidity has 2 values, where normal has 7 samples and high has 7 samples, then

1 H(HUMIDITY) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0
    • Windy Property

Attribute windy has 2 values, where true has 6 samples, false has 8 samples, then

1 H(WINDY) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516

Based on the results above, we can calculate the information gain rate as follows:

1 IGR(OUTLOOK) = Info(OUTLOOK) / H(OUTLOOK) = 0.246/1.577406282852345 = 0.15595221261270145
2 IGR(TEMPERATURE) = Info(TEMPERATURE) / H(TEMPERATURE) = 0.029 / 1.5566567074628228 = 0.018629669509642094
3 IGR(HUMIDITY) = Info(HUMIDITY) / H(HUMIDITY) = 0.151/1.0 = 0.151
4 IGR(WINDY) = Info(WINDY) / H(WINDY) = 0.048/0.9852281360342516 = 0.048719680492692784

According to the obtained information gain rate, the attribute in the selection attribute set is used as the decision tree node, and the node is split. From the above information gain rate IGR The information gain rate of Outlook is the largest, so we select it as the first node.

4. Algorithmic features

4.1 Pruning of decision trees

When the decision tree is created, because of noise and outliers in the data, many branches reflect anomalies in the training data. Pruning methods are used to deal with this problem of overfitting data. Usually pruning methods use statistical measures to cut off the most unreliable branches.

Pruning generally divided into two methods: first pruning and post-pruning.

Pruning is done by pruning the tree in advance by stopping it prematurely (such as deciding that a node no longer splits or divides a subset of the training tuple). Once stopped, the node becomes a leaf, and the leaf may take its own class as the most frequently held subset of the class. First pruning there are many methods, such as (1) When the decision tree reaches a certain height to stop the growth of the decision Tree; (2) The instances that reach this node have the same eigenvectors, without necessarily belonging to the same class, or stop growing (3) when the number of instances reaching this node is less than a certain threshold, the tree growth can be stopped. The disadvantage is that it is not possible to handle those special cases where the amount of data is smaller (4) calculates the gain of each extension to the performance of the system, and if it is less than a certain threshold, it can stop growing. First pruning has a disadvantage is the field of vision effect, that is, under the same standards, perhaps the current expansion can not meet the requirements, but further expansion and can meet the requirements. This will prematurely stop the decision tree from growing.

Another more common method is post-pruning, which is formed by cutting off the subtree by a fully grown tree. Replace the node by removing its branches and using the leaves. Leaves are usually marked with the most frequent categories in the subtree. There are two methods of post-pruning:

The first method, and the simplest method, is called the pruning based on miscalculation. This idea is straightforward, the complete decision tree is not over-fitting, and I'm going to get a test data set to correct it. For each subtree of a non-leaf node in the full decision tree, we try to replace it with a leaf node, the category of the leaf node, which is replaced by the most existing class in the training sample covered by the subtree, thus creating a simplified decision tree and comparing the performance of the two decision trees in the test data set, If the simplification of the decision tree in the test dataset is less error, and the subtree does not contain another similar characteristics of the subtree (so-called similar characteristics, refers to the tree replaced with a leaf node, the test data set of the characteristics of the lower rate of miscalculation), then the subtree can be replaced by leaf nodes. The algorithm iterates through all subtrees in a bottom-up manner until no subtree can be replaced and the performance of the test data set is improved, the algorithm can be terminated.

The first method is straightforward, but requires an additional test data set, can you not do this extra data set? In order to solve this problem, we put forward the pessimistic pruning. Pessimistic pruning is the recursive estimation of the false rate of the sample nodes covered by each internal node. After pruning, the inner node becomes a leaf node, and the category of the leaf node is determined by the optimal leaf node of the original internal node. Then the error rate of the node before and after pruning is compared to determine whether to prune. The method is consistent with the first approach mentioned earlier, and the difference is in estimating the error rate of the internal nodes of the classification tree before pruning.

If a subtree (with multiple leaf nodes) is replaced by a leaf node, the rate of miscalculation on the training set is definitely rising, but not necessarily on the new data. So we need to calculate the miscalculation of the tree and add an empirical penalty factor. For a leaf node, which covers the N_i sample, which has an E error, then the error rate of the leaf node is (e+0.5)/n_i. This 0.5 (refer to the continuous correction for details) is a penalty factor, then a subtree, which has a leaf node of L, then the false rate of this subtree is estimated. In this way, we can see that a subtree has multiple sub-nodes, but due to the addition of a penalty factor, the miscalculation rate of the subtree may not be cheap. After pruning, the internal node becomes the leaf node, and the number of false errors J also needs to add a penalty factor to become j+0.5. Whether or not the subtree can be pruned depends on the error j+0.5 in the standard error after pruning. For sample error rate E, we can estimate it as a variety of distribution models based on experience, such as a two-item distribution, or a normal distribution.

So a tree for a data, the error classification of a sample value of 1, the correct classification of a sample value of 0, the tree error classification of the probability (the rate of miscarriage) is e_1 (can be counted out), then the number of false count of the tree is two distribution, we can estimate the number of errors of the tree mean and standard deviation:

which

When the tree is replaced with a leaf node, the number of false errors of the leaves is also a Bernoulli distribution, where N is the number of data reaching the leaf node, the probability of the false rate of e_2 is (j+0.5)/n, so the number of false errors of the leaf node is

Using the training data, the subtree is always smaller than the error caused by replacing a leaf node, but the error calculation method after using the correction is not so, and when the number of false errors in the subtree is greater than the number of false positives in the corresponding leaf node, it is decided to prune:

This condition is the standard of pruning.

Popular point, is to see the error rate after pruning will become very large (than the error rate before pruning plus its standard deviation is also large), if the error rate after pruning becomes very high, then do not prune, otherwise, pruning. Here's a concrete example of how the pruning is done.

For example: This is a sub-decision tree, where T1,T2,T3,T4,T5 is a non-leaf node, t6,t7,t8,t9,t10,t11 is a leaf node, here we can see n= sample sum 80, where a class 55 samples, Class B 25 samples.

Node E (Subtree) SD (Subtree) E (Subtree) +sd (subtree) E (Leaf) Whether to prune
T1 8 2.68 10.68 25.5 Whether
T2 5 2.14 7.14 10.5 Whether
T3 3 1.60 4.60 5.5 Whether
T4 4 1.92 5.92 4.5 Is
T5 1 0.95 1.95 4.5 Whether

At this point, only the node T4 to meet the pruning criteria, we can cut off the node T4, that is, to directly change the T4 to leaf node A.

But it does not have to be a large standard deviation, the method is extended to a pruning method based on the ideal confidence interval (confidence intervals, CI), which models the error rate E of the leaf node into a random variable subject to two-item distribution, and an upper bound for a confidence interval threshold CI E_ Max, which causes the E<e_max to 1-ci (the default CI value for the C4.5 algorithm is 0.25), and if P (E<e_max) >1-ci, then prune. One more step, we can use normal distribution to approximate e (as long as n is large enough), based on these constraints, the upper bounds of the expected error of the C4.5 algorithm E_max (generally with Wilson score interval) are:

The   medium Z selection is based on the ideal confidence interval, assuming that z is a normal random variable with a 0 mean and a unit variance, which is n (0,1). Why Wilson score interval is chosen as an upper bound, The main reason is that the upper bound can have some good properties in the case of a few samples or the existence of extreme probability data sets. See the links below: About Wilson score interval see: Http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Normal_ approximation_interval     4.2 for continuous data processing     discretization: discretization of continuous-type attribute variables to form a training set of decision trees in three steps:     1.  sort the sample (corresponding to the root node) or subset of samples (corresponding subtree) according to the size of the continuous variable from small to large,     2.  assume that the property corresponds to a different property value of a total of N, So there is a total of N-1 possible candidate segmentation threshold points, the value of each candidate segmentation threshold point for the value of the above sorted attribute values of 22 consecutive elements in the midpoint     3.  with the information gain rate select the best division       4.3 Handling of missing values      missing values: In some cases, the data available for use may be missing values for some properties. For example (X, y) is a training instance in the sample set S, x= (F1_v,f2_v, ... Fn_v). But the value of its property fi fi_v unknown.     processing Strategy:    1.  a strategy for handling missing attribute values is the most common value of this attribute in the training instance assigned to it node T,     2.  Another more complex strategy is to give a probability to each possible value of fi. For example, given a Boolean property of FI, if the node T contains 6 known fi_v=1 and 4 fi_v=0 instances, then the probability of fi_v=1 is 0.6, and the fi_v=0 probability is 0.4. Thus, 60% of instance X is assigned to the branch of Fi_v=1, and 40% is assigned to another branch. The purpose of these fragment samples (fractional examples) is to calculate the information gain, in addition, if there is a second missing value attribute must be tested, these samples can be further fine in the subsequent tree branchScore of (used in C4.5)     3.  simple processing strategy is to discard these samples      4.4 c4.5 algorithm advantages and disadvantages      Advantages: The resulting classification rules are easy to understand and the accuracy rate is high.     Disadvantage: In the process of constructing the tree, the data sets need to be scanned and sorted several times, resulting in inefficient algorithm. 5. Code implementation     The code runs in the R language on the dataset Iris, as long as the three installation packages "Rweka", "Party", "Partykit" are installed first. That is, run the following code:         Then run the following example code:        Code and result analysis:    Code 6, The 7,8 load pack does not explain that code 9 loads the dataset Iris, code 10 calls the function J48 (i.e. C4.5) in Weka, the parameters are applied very clearly, species is the dependent variable, the remainder is an argument, and the dataset is iris. Code 11 to show the code 10,11 build the pruning tree see below:        which, the result of the first line is the width of the petals <=0.6, there are setosa flowers 50 samples, >0.6 case, See if the width of the petals is greater than 1.7, and so on, the control tree structure chart will be easier to understand, I believe that smart you can understand. About the horizontal axis of the last five bars in the tree-shaped result graph: The type of flower, the column indicates the accuracy of the classification. The last two lines below represent the number of leaf nodes and the size of the tree (total number of nodes) .    reference: http://blog.csdn.net/xuxurui007/article/details/ 18045943http://www.cnblogs.com/zhangchaoyang/articles/2842490.htmlhttp://en.wikipedia.org/wiki/binomial_ Proportion_confidence_interval#normal_approximation_intervahttp://www.biostatistic.net/thread-95651-1-1.html

Original link: http://blog.csdn.net/x454045816/article/details/44726921

C4.5 algorithm (excerpt)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.