Decision Tree algorithm

Last Update:2017-06-01 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Decision Tree/Judgment tree

A decision tree is a tree structure similar to a flowchart where each internal node represents a test on an attribute, each branch represents an attribute output, and each leaf node represents a class or class distribution. The topmost layer of the tree is the root node

As shown, the root node is play 9 + don ' t play 5, which generates three branches based on Outlook attributes sunny, overcast, rain, which is overcast when a class is generated, that is, play 4 + don ' t play 0, for sunny and rain to continue to be divided according to other attributes.

2. Decision Tree Construction

The decision tree construction process does not rely on domain knowledge, it uses the attribute selection metric to select the attributes that best divide the tuple into different classes. The structure of the so-called decision tree is to make the attribute selection measure to determine the topological structure between each characteristic attribute.

The key step in constructing a decision tree is the split attribute. The so-called splitting attribute is to construct different branches according to the different division of a certain characteristic attribute at a certain node, whose goal is to make each split subset as "pure" as possible. As "pure" as possible, try to get a split sub-cluster to belong to the same category. The split attribute is divided into three different situations:

1, the attribute is discrete and does not require a two-fork decision tree to be generated. At this point, each partition of the attribute is used as a branch.

2, attributes are discrete values and require a two-fork decision tree to be generated. At this point, a subset of attribute partitioning is used for testing, divided into two branches according to "belongs to this subset" and "does not belong to this subset".

3, the attribute is a continuous value. At this point a value is determined as the split point Split_point, and two branches are generated according to >split_point and <=split_point.

The key content of constructing decision tree is to make the attribute selection measure, the attribute selection measure is a selective splitting criterion, and it is the heuristic method of dividing D "best" into the individual class of the data of the training set of the given class tag, which determines the split_point choice of topological structure and splitting point.

3. ID3 algorithm

The tree begins with a single node representing the training sample.
If the sample is in the same class, the node becomes a leaf and is labeled with that class.
Otherwise, the algorithm uses entropy-based metrics called information gain as heuristic information, choosing the attributes that best classify the samples. This property becomes the "test" or "decision" property of the node. In this version of the algorithm, all properties are categorized, that is, discrete values. Continuous attributes must be discretized. For each known value of the test property, create a branch and divide the sample accordingly. The algorithm uses the same process to recursively form a sample decision tree on each partition. Once an attribute appears on a node, it is not necessary to consider it on any descendant of that node.

The recursive partitioning step is stopped only if one of the following conditions is true:

(a) All samples of a given node belong to the same class.

(b) No remaining attributes can be used to further divide the sample. In this case, the majority vote is used. This involves converting a given node to a leaf and marking it with the class in which the majority of the sample is located. Instead, the class distribution of the node sample can be stored.

Several related concepts:

1, set D for the division of Training tuples by category, then the entropy of D (entropy) is expressed as:

where pi represents the probability that the class I category appears in the entire training tuple, you can estimate the number of elements belonging to this category divided by the total number of element of the training tuple. The actual meaning of entropy is the average amount of information required for the class label of a tuple in d.

2. Suppose that the training tuple D is divided by attribute A, then the expected information for A to D is:

3, the information gain is the difference between the two values:

example, as shown in:

which

Similarly, Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit_rating) = 0.048. So select Age as the root node, that is, repeat the above steps until the stop condition is met. The result is: Branch root nodes node leaves 4. C4.5 algorithm

One problem with the ID3 algorithm is that it is biased towards multivalued attributes, for example, if there is a unique identity attribute ID, then ID3 chooses it as a splitting attribute, which makes the partition sufficiently pure, but it is almost useless for classification. The successor algorithm of ID3 C4.5 uses the gain rate (gain ratio) information gain expansion to try to overcome this bias.

The C4.5 algorithm first defines "split information", which can be expressed as:

Each symbolic meaning is the same as the ID3 algorithm, and then the gain rate is defined as:

C4.5 chooses the attribute with the maximum gain rate as the splitting attribute, and its specific application is similar to ID3.

Decision Tree algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More