Decision Tree Algorithm

Source: Internet
Author: User
Tags id3
1. Summary

In the previous two articles, the naive Bayes classification and Bayesian Network classification algorithms are introduced and discussed respectively. These two algorithms are based on Bayesian theorem and can be used to deduce the probability of classification and decision-making problems. In this article, we will discuss another widely used classification algorithm-Decision Tree
Tree ). Compared with Bayesian algorithms, decision trees do not require any domain knowledge or parameter settings during the construction process. Therefore, in practical applications, decision trees are more suitable for probe-based Knowledge Discovery.

2. Decision Tree Guidance

In general, the idea of decision tree classification is similar to finding objects. Now imagine a girl's mother would introduce her boyfriend to her, so she had the following conversation:

Daughter: How old is it?

Mother: 26.

Daughter: Long Shuai?

Mother: very handsome.

Daughter: high income?

Mother: Not very high. Moderate.

Daughter: is it a civil servant?

Mother: Yes. I work in the tax bureau.

Daughter: Well, I'll see you.

This girl's decision-making process is a typical classification tree decision. It is equivalent to dividing men into two categories by age, appearance, income, and whether or not civil servants: Seeing and seeing. Assume that the girl's requirements for men are: civil servants under the age of 30, who are of medium or higher sizes and are high-income or above, then this can be used to represent the girl's decision-making logic (statement: this decision tree is purely a product of YY for writing articles. It has no basis and does not represent any girl's mate selection tendency. Please ask my fellow female questions ^_^ ):

The complete expression shows the girl's policy to determine whether to see a dating object. The green node indicates the judgment condition, and the orange node indicates the decision result, the arrow indicates the decision path under different conditions. The red arrow indicates the decision process of girls in the preceding example.

This figure can basically be regarded as a decision tree, saying that it is "basically measurable" because the judgment conditions in the figure are not quantified, such as income in high school and low school, and so on. It cannot be regarded as a decision tree in a strict sense, if all the conditions are quantified, the decision tree becomes the true one.

With the intuitive understanding above, we can formally define the decision tree:

A decision tree is a tree structure (which can be a binary tree or a non-Binary Tree ). Each non-leaf node represents a test on a feature attribute, and each branch represents the output of this feature attribute on a value field, and each leaf node stores a category. The decision tree is used to test the corresponding feature attributes of the items to be classified from the root node, and select the output branch based on the value until it reaches the leaf node, the classification of leaf nodes is used as the decision result.

We can see that the decision tree decision making process is very intuitive and easy to understand. Currently, decision trees have been successfully applied in many fields such as medicine, manufacturing, astronomy, branch biology, and commerce. After understanding the definition of a decision tree and its application methods, the following describes the decision tree construction algorithm.

3. Decision Tree Construction

Different from Bayesian algorithms, the construction process of decision trees does not rely on domain knowledge. It uses attribute selection metrics to select and divide tuples into attributes of different classes. The so-called decision tree structure is to select a property measurement to determine the topology between each feature attribute.

The key step to construct a decision tree is to split the attributes. The so-called split attribute is to construct different branches at a node based on different features and attributes. The goal is to make each split subset as pure as possible ". If possible, "pure" means to make a split subset to be classified belong to the same category. Split attributes are divided into three different situations:

1. The property is a discrete value and the binary decision tree is not required. In this case, each partition of the attribute is used as a branch.

2. The property is a discrete value and the binary decision tree must be generated. In this case, a subset of the attribute is used for testing, and two branches are divided according to "belong to this subset" and "not belong to this subset.

3. the attribute is a continuous value. At this time, determine a value as the split point split_point and generate two branches according to> split_point and <= split_point.

The key to constructing a decision tree is to select an attribute measurement. An Attribute selection measurement is a selection splitting criterion, it is a heuristic method that divides the data in a given training set of class tags into individual classes "best". It determines the topology structure and the choice of the split point split_point.

There are many attribute selection Measurement Algorithms. Generally, the top-down recursive splitting method is used, and the greedy policy without Backtracking is adopted. Here we will introduce ID3 and C4.5 common algorithms.

3.1 ID3 algorithm

From the information theory knowledge, the smaller the expected information, the larger the information gain, and the higher the purity. Therefore, the core idea of the ID3 algorithm is to select the information gain measurement attribute and select the attribute with the largest information gain after splitting for splitting. Next we will first define several concepts to be used.

If D is used to divide training tuples by category, the entropy of D is:

Pi indicates the probability that the I-th category appears in the entire training tuples. You can use the number of elements in this category divided by the total number of elements in the training tuples as an estimate. The actual meaning of entropy indicates the average information required by the class label of the element group in D.

Now, if we divide the training tuples d by attribute a, the expected information of A to D is:

The information gain is the difference between the two:

The ID3 algorithm calculates the gain rate of each attribute each time it needs to be split, and then selects the attribute with the highest gain rate for splitting. Next we will continue to use the examples of false account detection in the SNS community to illustrate how to use the ID3 algorithm to construct a decision tree. For simplicity, we assume that the training set contains 10 elements:

S, M, and l indicate small, medium, and large, respectively.

Set L, F, H, and r to indicate the log density, friend density, whether to use the real avatar, and whether the account is real. The information gain of each attribute is calculated below.

Therefore, the information gain of log density is 0.276.

In the same way, the information gains of H and F are 0.033 and 0.553, respectively.

Because F has the maximum information gain, F is selected as the splitting attribute for the first split. The result after split is shown as follows:

Then, recursively use this method to calculate the splitting attribute of the subnode, and finally obtain the entire decision tree.

For simplicity, the feature attributes are discretization. In fact, both log density and friend density are continuous attributes. If the feature attribute is continuous, you can use the ID3 algorithm as follows:

First, the elements in "D" are sorted by feature attributes. The intermediate points of each two adjacent elements can be considered as potential Split points starting from the first potential split point, split D and calculate the expected information of two sets. The point with the smallest expected information is called the best split point of this attribute, and its information is expected to be the information expectation of this attribute.

3.2. C4.5 algorithm

The ID3 algorithm is biased towards multi-value Attributes. For example, if there is a unique identifier attribute ID, ID3 selects it as the splitting attribute. Although this makes the division completely pure, however, this classification is almost useless for classification. Subsequent algorithms C4.5 of ID3 use gain rate (gain
Ratio) to overcome this bias.

The C4.5 algorithm first defines "split information", which can be expressed:

The symbols have the same meaning as the ID3 algorithm, and the gain rate is defined:

C4.5 selects the attribute with the maximum gain rate as the split attribute. Its specific application is similar to ID3 and will not be repeated.

4. Some additional information about Decision Trees 4.1. What should I do if the attributes are used up?

This may occur during decision tree construction: All attributes are used up as split attributes, but some subsets are not pure sets, that is, the elements in the set do not belong to the same category. In this case, because no more information can be used, we generally perform a "majority vote" on these subsets, that is, we use the category with the most frequent occurrences of this subset as the node category, then, the node is used as a leaf node.

4.2 about pruning

In actual decision tree construction, pruning is usually required. In this case, the over-fitting problem caused by Data Noise and outlier is solved. There are two types of pruning:

Pruning first-during the construction process, when a node meets the pruning condition, the construction of this branch is directly stopped.

Post-pruning-construct a complete decision tree and then use certain conditions to traverse the tree for pruning.

Specific pruning algorithms are not described here. If you are interested, refer to the relevant literature.

This article from http://www.cnblogs.com/hexinuaa/articles/2143531.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.