Zhang Yang Algorithm grocery store-decision tree for classification algorithms (decision Trees)
2010-09-19 16:30 by T2 phage, 44346 reading, 29 reviews, Favorites, compilation
3.1. Abstract
In the previous two articles, two classification algorithms of naive Bayesian classification and Bayesian network are introduced and discussed respectively. Both of these algorithms are based on Bayesian theorem and can be used to deduce the probability of classification and decision-making. In this article, we will discuss another widely used classification algorithm-the decision tree (decision trees). Compared with Bayesian algorithm, the advantage of decision tree is that the construction process does not require any domain knowledge or parameter setting, so in practical application, the decision tree is more suitable for the detection of knowledge discovery.
3.2. Decision Tree Guidance
In layman's terms, the idea of a decision tree classification is similar to finding an object. Now imagine a girl's mother to introduce a boyfriend to this girl, so the following dialogue:
Daughter: How old are you?
Mother: 26.
Daughter: Long handsome not handsome?
Mother: Very handsome.
Daughter: Is the income high?
Mother: Not very high, medium condition.
Daughter: Is it a civil servant?
Mother: Yes, I work in the Inland Revenue Department.
Daughter: Well, I'll meet you.
This girl's decision-making process is a typical classification tree decision. The equivalent of dividing a man into two categories through age, appearance, income and civil servants: see and disappear. Suppose the girl's requirement for a man is: a civil servant who is under the age of 30, who is above average in appearance and is a high-income or middle-income person, then this can be used to represent the girl's decision logic ( statement: This decision tree is purely for writing articles and YY product, there is no basis, also does not mean that any girl's mate Ladies and gentlemen, don't question me ^_^):
Fully expressed the girl's decision to see a date, where the green node indicates the criteria, the orange node represents the decision result, and the arrows indicate the decision path in a different situation, the Red arrows indicate the girl's decision-making process in the example above.
This picture basically can be regarded as a decision tree, said that it "basically can calculate" is because the decision conditions in the figure is not quantified, such as income high school low, and so on, not a strict decision tree, if all the conditions are quantified, it becomes a real decision tree.
With the intuitive understanding above, we can formally define the decision tree:
The decision tree (decision tree) is a tree structure (can be a two-fork tree or a non-binary). Each of its non-leaf nodes represents a test on a feature attribute, and each branch represents the output of the feature attribute on a range of domains, and each leaf node holds a category. The process of decision making using decision tree is to test the corresponding feature attribute in the classification item from the root node, and select the output branch according to its value until the leaf node is reached, and the category of leaf node is stored as the decision result.
It can be seen that decision tree decision-making process is very intuitive and easy to understand. The decision tree has been successfully used in many fields such as medicine, manufacturing industry, astronomy, branch biology and commerce. Knowing the definition of decision tree and its application method, the construction algorithm of decision tree is described below.
3.3. Structure of decision Tree
Unlike the Bayesian algorithm, the decision tree construction process does not depend on domain knowledge, it uses the attribute selection metric to select the attribute that best divides the tuple into different classes. The structure of the so-called decision tree is to make the attribute selection measure to determine the topological structure between each characteristic attribute.
The key step in constructing a decision tree is the split attribute. The so-called splitting attribute is to construct different branches according to the different division of a certain characteristic attribute at a certain node, whose goal is to make each split subset as "pure" as possible. As "pure" as possible, try to get a split sub-cluster to belong to the same category. The split attribute is divided into three different situations:
1, the attribute is discrete and does not require a two-fork decision tree to be generated. At this point, each partition of the attribute is used as a branch.
2, attributes are discrete values and require a two-fork decision tree to be generated. At this point, a subset of attribute partitioning is used for testing, divided into two branches according to "belongs to this subset" and "does not belong to this subset".
3, the attribute is a continuous value. At this point a value is determined as the split point Split_point, and two branches are generated according to >split_point and <=split_point.
The key content of constructing decision tree is to make the attribute selection measure, the attribute selection measure is a selective splitting criterion, and it is the heuristic method of dividing D "best" into the individual class of the data of the training set of the given class tag, which determines the split_point choice of topological structure and splitting point.
There are many methods of attribute selection, such as top-down recursive method and greedy strategy with no backtracking. The two common algorithms of ID3 and C4.5 are introduced here.
3.3.1, ID3 algorithm
From the information theory knowledge, we expect that the smaller the message, the greater the information gain, and the higher the purity. So the core idea of the ID3 algorithm is to select the information gain metric attribute and choose the attribute with the greatest information gain after splitting. Let's define a few concepts to use here.
Set D to divide the training tuple by category, then the entropy of D (entropy) is expressed as:
where pi represents the probability that the class I category appears in the entire training tuple, you can estimate the number of elements belonging to this category divided by the total number of element of the training tuple. The actual meaning of entropy is the average amount of information required for the class label of a tuple in d.
Now we assume that the training tuple D is divided by attribute A, then the expected information for A to D is:
The information gain is the difference between the two values:
The ID3 algorithm calculates the gain rate for each attribute each time a split is needed, and then chooses the attribute with the highest gain rate to split. Below we continue to use an example of an unreal account detection in the SNS community to illustrate how to construct a decision tree using the ID3 algorithm. For the sake of simplicity, let's assume that the training set consists of 10 elements:
where S, M, and L represent small, medium and large respectively.
Set L, F, H, and r to indicate log density, friend density, whether to use real avatar and account is true, the information gain of each attribute is calculated below.
So the information gain for log density is 0.276.
In the same way, the information gain of H and F is 0.033 and 0.553, respectively.
Since F has the maximum information gain, the first Division selects F for the split attribute, after which the result of the splitting is as indicated:
On the basis of this method, the split attribute of child nodes is calculated recursively, and the whole decision tree can be obtained finally.
In order to simplify the above, the feature attributes are discretized, in fact, log density and friend density are continuous attributes. For a feature attribute to be a continuous value, you can use the ID3 algorithm as follows:
First the elements in D are sorted by feature attribute, then the midpoint of each two adjacent elements can be regarded as potential splitting point, starting from the first potential splitting point, splitting D and calculating the expected information of two sets, the point with the minimum expected information is called the best splitting point of the attribute, and its information expectation is expected as information of this attribute.
3.3.2, C4.5 algorithm
One problem with the ID3 algorithm is that it is biased towards multivalued attributes, for example, if there is a unique identity attribute ID, then ID3 chooses it as a splitting attribute, which makes the partition sufficiently pure, but it is almost useless for classification. The successor algorithm of ID3 C4.5 uses the gain rate (gain ratio) information gain expansion to try to overcome this bias.
The C4.5 algorithm first defines "split information", which can be expressed as:
Each symbolic meaning is the same as the ID3 algorithm, and then the gain rate is defined as:
C4.5 chooses the attribute with the maximum gain rate as the splitting attribute, its specific application is similar to the ID3, no longer repeat.
3.4, a few supplementary notes on decision Tree 3.4.1, if the attribute is exhausted what to do
This can happen during the decision tree construction: All properties are exhausted as split properties, but some subsets are not pure, that is, the elements within the collection do not belong to the same category. In this case, since no more information is available, a "majority vote" is generally made on these subsets, even if the most frequently occurring category in this subset is used as the node category, and then the node is used as a leaf node.
3.4.2, about pruning
When the decision tree is actually constructed, pruning is usually done in order to deal with the problem of overfitting caused by noise and outliers in the data. There are two types of pruning:
Pruning first-in the construction process, when a node satisfies the pruning condition, the construction of this branch is stopped directly.
Post-pruning-The complete decision tree is constructed first, and then the pruning is done by some conditional traversal tree.
Specific algorithms for pruning are no longer detailed here and are interested in referring to the relevant literature.
Reprint: Algorithm grocer--decision Tree of classification algorithm (decision trees)