C4.5 (decision Tree)

Last Update:2016-08-14 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

C4.5 is a series of algorithms used in machine learning and data mining classification problems. Its goal is to supervise learning: Given a dataset, each tuple can be described with a set of attribute values, each of which belongs to a class in a mutually exclusive category. The goal of C4.5 is to find a mapping relationship from attribute values to categories by learning, and this mapping can be used to classify entities that are unknown to the new category.

C4.5 was proposed by J.ross Quinlan on the basis of ID3. The ID3 algorithm is used to construct decision trees. A decision tree is a flowchart-like tree structure in which each internal node (non-leaf node) represents a test on an attribute, each branch represents a test output, and each leaf node holds a class label. Once a decision tree is established, for a tuple that is not given a class designator, a path with a root node to the leaf node is tracked, and the leaf node holds the predictions for that tuple. The advantage of a decision tree is that it does not require any domain knowledge or parameter setting and is suitable for exploratory knowledge discovery.

From the ID3 algorithm, the C4.5 and cart two algorithms are derived, both of which are very important in data mining. is a typical C4.5 algorithm that produces a decision tree for a dataset.

As shown in DataSet 1, it represents the relationship between weather conditions and going golfing.

Figure 1 Data set

Figure 2 decision tree generated by C4.5 on a dataset

Algorithm description

C4.5 is not an algorithm, but a set of algorithms-c4.5, non-pruning C4.5 and C4.5 rules. The algorithm in the C4.5 will give the basic workflow:

Figure 3 C4.5 algorithm Flow

We may have doubts that a tuple itself has many properties, how do we know which property to judge first, and which property to judge next? In other words, in Figure 2, how do we know that the first property to test is Outlook, not windy? In fact, one of the concepts that can answer these questions is the attribute selection metric.

Attribute selection measures

Attribute selection measures are also called split rules, because they determine how tuples on a given node are split. The attribute selection metric provides a rank rating for each attribute that describes a given training tuple, and the attribute with the best metric score is selected as the split attribute for a given tuple. At present, the most popular attribute selection measures are the information gain, gain rate and Gini index.

First make some assumptions, set D is the class tag tuple training set, the class label attribute has m different values, M different class CI (i=1,2,..., M), Cid is a set of the CI class tuple in D, | D| and | Cid| are the number of tuples in D and CID, respectively.

(1) Information gain

The information gain is actually the attribute selection metric used in the ID3 algorithm. It chooses the attribute with the highest information gain as the split attribute of node N. This property minimizes the amount of information required for the tuple classification in the result partition. The expected information required for the tuple classification in D is the following:

1) Info (D) is also called entropy.

It is now assumed that the tuple in D is divided by attribute A, and that attribute a divides d into a V different class. After this division, the information needed to obtain an accurate classification is measured by the following:

2) The information gain is defined as the difference between the original information requirement (that is, based on the class scale only) and the new requirement (that is, the resulting after a division), i.e.

3) I think a lot of people see this place is not very good understanding, so my own study of the literature on this piece of the description, also contrasted with the above three formulas, the following say my own understanding.

In general , for a tuple with multiple attributes, it is almost impossible to use a single attribute to separate them completely, otherwise the depth of the decision tree can only be 2 up. As can be seen here, once we select an attribute a, we assume that the tuple is divided into two parts A1 and A2, because A1 and the A2 You can also use other attributes to re-divide, so it raises a new question: Which attribute will we choose to classify next? the expected information required for a tuple classification in D is info (d), so in the same vein, when we pass a D divided into v a subset Dj (j=1,2,..., v) after that, we have to Dj tuples, the desired information is Info (Dj), and altogether there are v class, so the v Collection, the information needed is a formula ( 2 ). So, if the formula (2) is smaller, does that mean that the less information we need to classify a few sets of a is next? For a given training set, actually info (D) is fixed, so select the attribute with the most information gain as the split point.

However, the use of information gain is actually a disadvantage, that is, it is biased towards properties with a large number of values. What do you mean? That is, in a training set, the more different values a property takes, the more likely it is to take it as a splitting attribute. For example a training set has 10 tuples, for a certain zodiac A, it takes 1-10 of these 10 numbers, if splitting a will be divided into 10 classes, then for each class info (Dj) = 0, thereby the formula (2) is 0, the attribute divides the resulting information gain (3) is the largest, but obviously, This division is meaningless.

(2) information gain rate

It is for this reason that the C4.5 behind ID3 uses the concept of information gain rate. Information gain rate normalizes the information gain using the split information value. The classification information is similar to info (D) and is defined as follows:

4) This value represents the information generated by dividing the training DataSet D into a V partition corresponding to the V output of the property a test. Information gain rate Definition:

5) Select the attribute with the maximum gain rate as the split attribute.

(3) Gini indicator

The Gini indicator is used in the cart. Gini metric data partitioning or training tuple set D's purity, defined as:

(6)

Other features

Tree pruning

When the decision tree is created, because of noise and outliers in the data, many branches reflect anomalies in the training data. Pruning methods are used to deal with this problem of overfitting data. Usually pruning methods use statistical measures to cut off the most unreliable branches.

Pruning generally divided into two methods: first pruning and post-pruning.

Pruning is done by pruning the tree in advance by stopping it prematurely (such as deciding that a node no longer splits or divides a subset of the training tuple). Once stopped, the node becomes a leaf, and the leaf may take its own class as the most frequently held subset of the class. There are many ways to prune first,

Like what

(1) When the decision tree reaches a certain height, it stops the growth of the decision tree;

(2) Instances that arrive at this node have the same eigenvector, and do not necessarily belong to the same class, or can stop growing

(3) When the number of instances arriving at this node is less than a certain threshold, the growth of the tree can be stopped, and the disadvantage is that it cannot handle the special cases where the data volume is smaller.

(4) Calculate the gain of each expansion to the system performance, if it is less than a certain threshold can let it stop growing.

First pruning has a disadvantage is the field of vision effect, that is, under the same standards, perhaps the current expansion can not meet the requirements, but further expansion and can meet the requirements. This will prematurely stop the decision tree from growing.

Another more common method is post-pruning, which is formed by cutting off the subtree by a fully grown tree. Replace the node by removing its branches and using the leaves. Leaves are usually marked with the most frequent categories in the subtree.

C4.5 uses the pessimistic pruning method, it uses the training set to generate the decision tree and uses it to prune, does not need the independent pruning set.

The basic idea of pessimistic pruning method is that the decision tree set up for training set is T, and the tuple of n in training set is classified by T, and K is the number of tuples to reach a leaf node, in which the number of errors in the classification is J. Because the tree T is generated by the training set and is suitable for the training set, j/k cannot reliably estimate the error rate. So use (j+0.5)/k to express. A subtree of S is T, the number of its leaf nodes is L (s), the sum of the number of tuples that reach the leaf nodes of this subtree, and the number of tuples that are incorrectly categorized in this subtree. When a new tuple is categorized, its number of error classifications is, and its standard error is expressed as:. When using this tree to classify training sets, set E as the number of classification errors, when the following formula is established, then delete the subtree s, with the leaf node instead, and S subtree does not have to be calculated.

。

C4.5 (decision Tree)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More