Tree. J48

Last Update:2016-05-22 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Weka is a machine learning tool based on Java. Easy to get started and provide a graphical interface. Provide tools such as classification, clustering, and frequent item mining. This article mainly writes about the J48 algorithm in the classifier algorithm and in fact now.

First, the algorithm

J48 is a decision tree algorithm based on C4.5 implementation. There is too much data for the C4.5 algorithm. The author is reproduced here in part (Source: http://blog.csdn.net/zjd950131/article/details/8027081)

C4.5 is a series of algorithms used in machine learning and data mining classification problems.

Its goal is to supervise learning: given a data set, each tuple can be described by a set of attribute values, each of which belongs to a class in a mutually exclusive category. The goal of C4.5 is through learning. Find a mapping relationship from attribute values to categories, and this mapping can be used to classify entities that are unknown to the new category.

C4.5 was proposed by J.ross Quinlan on the basis of ID3. The ID3 algorithm is used to construct decision trees. A decision tree is a flowchart-like tree structure in which each internal node (not a leaf node) represents a test on an attribute, and each branch represents a test output. And each leaf node holds a class label. Once the decision tree has been established. For a tuple that is not given a class designator, a path that has a root node to a leaf node is tracked, and the leaf node holds the pre-measurement of that tuple. The advantage of a decision tree is that it does not require any domain knowledge or set of parameters. Suitable for the detection of knowledge discovery.

From the ID3 algorithm, we derive the C4.5 and cart two kinds of algorithms. Both of these algorithms are important in data mining. is a typical C4.5 algorithm that produces a decision tree for a dataset.

As seen in data set 1. It represents the relationship between weather conditions and the way to play golf.

Figure 1 Data set

Figure 2 decision tree generated by C4.5 on a dataset

Algorithm descriptive narration

C4.5 is not an algorithm, but a set of algorithms-c4.5, non-pruning C4.5 and C4.5 rules. The algorithm in the C4.5 will give the basic workflow:

Figure 3 C4.5 algorithm Flow

We may have doubts that a tuple itself has very many attributes, how do we know which property to infer first, and which property to infer next? In other words, in Figure 2, how do we know that the first property to test is Outlook, not windy? In fact, one of the concepts that can answer these questions is the attribute selection metric.

Attribute selection measures

Attribute selection measures are also called split rules, because they determine how tuples on a given node divide. The attribute selection metric provides each attribute description describing the rank evaluation of a given training tuple. Attributes with the best metric score are selected as split properties for a given tuple. Now the more popular attribute selection measures are the information gain, the gain rate, and the Gini indicator.

First do some if, set D is a class tag tuple training set, the class label attribute has m different values, M different class CI (i=1,2,..., M), Cid is a set of tuple of CI class in D, | D| and | Cid| are the number of tuples in D and CID respectively.

(1) Information gain

The information gain is actually the attribute selection metric used in the ID3 algorithm. It chooses the attribute with the highest information gain as the split attribute of node N. This property minimizes the amount of information required for the tuple classification in the result partition. The expected information required for the tuple classification in D is the following:

(1)

Info (D) is also called entropy.

Now it is assumed that the tuple in D is divided according to attribute a, and that attribute a divides d into a V different class.

After this division, the information needed to obtain an accurate classification is measured by the following:

(2)

The information gain is defined as the difference between the original information requirement (that is, based on the class scale only) and the new demand (that is, the resulting after a division). That

(3)

I think a lot of people see this place as not very well understood, so my own study of the literature on this piece of descriptive narrative, also against the above three formulas. Here's my own understanding.

Generally speaking. For a tuple with multiple attributes, it is almost impossible to separate them with a single attribute, otherwise. The depth of the decision tree can only be 2. As can be seen from here, once we select an attribute a, if the tuple is divided into two parts A1 and A2, because A1 and A2 can be divided again with other attributes, it raises a new question: Which property should we choose to classify next? The expected information required for the tuple classification in D is info (d), so the same is the case when we divide D into a v subset DJ (j=1,2,..., v) by a. We want to classify the Dj's tuples, the desired information is info (DJ), and a common V class. So the information required for the re-classification of the V-sets is the formula (2). So, assuming that the smaller the formula (2), does it mean that the less information we need to classify a few sets of a? For a given training set, actually info (D) is fixed, so select the attribute with the most information gain as the split point.

But. The use of information gain is actually a disadvantage, which is that it favors attributes with a large number of values.

What do you mean? That means in the training set. The greater the number of different values that a property takes. Then the more likely it is to take it as a splitting attribute.

For example, there are 10 tuples in a training set, and for a zodiac A, it takes 1-10 of these 10 numbers respectively, assuming that splitting a is divided into 10 classes. So for each class info (Dj) = 0, thus the formula (2) is 0, the attribute divides the resulting information gain (3) the largest, but obviously, such a division is meaningless.

(2) Information gain rate

It is for this reason that the C4.5 behind ID3 uses the concept of information gain rate. Information gain rate normalizes the information gain using the split information value. The classification information is similar to info (D), which defines such as the following:

(4)

This value represents the information generated by dividing the training dataset D into V of the V output corresponding to the test of the property A. Information gain rate Definition:

(5)

Select the attribute with the maximum gain rate as the split attribute.

Second, the algorithm description

(1) We are going to construct a decision tree. Very naturally, each layer of the tree represents the value of an attribute, and the last leaf node points to the class being divided.

two to see.

(2) So the very natural question is how to choose the right node at each level to construct the tree to make the tree as optimal as possible, that is, to find the path as short as possible.

(3) So the most critical question is how to find the most suitable splitting node in each layer from the remaining nodes that are not yet allocated.

(4) The ID3 algorithm selects the optimal node by selecting the attribute with the highest information gain gain. Information gain can be simply understood as the reduction of uncertainty after the use of an attribute partition.

(5) and the C4.5 algorithm made an improvement. The advantage of using the attribute with the highest information gain rate is that it avoids the tree being too wide.

(6) After the tree has been built to do some pruning operations, of course, this is not in the main line of algorithms today. Also did not make the demand. But be aware of how Weka is implemented.

Third, the main data structure used in the algorithm

(1) Instances Object

A instances represents a single table. can be a corresponding Arff file or a CSV file, through the instances object can take a column of the mean of the poor, mainly is a package of several rows of records.

(2) Instance

A instance represents a row of records. In other words, the instances data includes multiple instance. Each instance will have a special column classindex that represents which category the instance belongs to. In detail, it is the golf in figure one.

(3) Classifier interface

Each classifier in the Weka inherits from this interface (although in a sense it is an interface but in fact a subclass). The interface provides a buildclassifier method to pass in a instances object for training. There is also a classifyinstance method for passing in a instance to infer which class it belongs to.

(4) J48

Classifier main class, implements the classifier interface.

(5) Classifiertree interface

Represents a node in a tree. Maintain and compose the structure of the tree. Among the J48 used are C45pruneableclassifiertree and pruneableclassifiertree.

(6) Modelselection interface

The interface is responsible for inferring and selecting the best properties. The different instance are then placed in different subset based on this property, and the Classifiertree interface uses modelselection to generate the structure of the tree.

This kind of abstract way is still very worthwhile to learn. The implementation of the interface used in J48 has binc45modelselection and C45modelselection, and by name it can be seen that the previous one is generating a two-fork tree (that is, each node contains only two answers). The latter one is the C45 tree that generates the standard.

Tree. J48

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More