absrtact: Decision tree is an important classification and regression method, which has the characteristics of strong readability and fast classification speed. Decision tree is a kind of tree structure, which realizes classification or regression through multiple if-then rules.
1. definition of decision Tree
The tree must be familiar to everyone, the structure of the two elements, the node and the edge. There are several keywords, root node, Father node, child node and leaf node.
The Father node and the child node are relative, plainly speaking the child node is divided by the Father node according to a certain rule, then the child node continues to divide as the new Father node, until it cannot be split. The root node is a node without a father, that is, the initial split point, the leaf node is the child's node, as shown in:
Decision tree is to use the structure of the tree to make decisions, starting with the node, and constantly split the data, the final leaf node reached is the result of the output.
2. How decision trees make decisions
From a simple classification example:
The bank uses a person's information to determine whether the person is interested in the loan, the specific information is as follows:
Occupational |
Age |
Income |
Degree |
Whether to loan |
Free career |
28 |
5000 |
High school |
Is |
Workers |
36 |
5500 |
High school |
Whether |
Workers |
42 |
2800 |
Junior high school |
Is |
White collar |
45 |
3300 |
Primary school |
Is |
White collar |
25 |
10000 |
Undergraduate |
Is |
White collar |
32 |
8000 |
Master |
Whether |
White collar |
28 |
13000 |
Dr |
Is |
Free career |
21st |
4000 |
Undergraduate |
Whether |
Free career |
22 |
3200 |
Primary school |
Whether |
Workers |
33 |
3000 |
High school |
Whether |
Workers |
48 |
4200 |
Primary school |
Whether |
(Note: The data in the above table is fabricated by me, does not have any practical significance)
The decision tree's approach is to use the structure of the tree , each time with an attribute to classify, know to get the results we want or can no longer be divided, as shown in:
Through our training data, we can get the decision tree above, if we want to analyze whether a customer has the intention of loan, we can analyze the result directly according to the customer's information.
If the information of a customer is: {Occupation, age, income, education}={worker, 39, 1800, Primary School}, the following analysis steps and conclusions can be obtained by inputting the information into the decision tree.
The first step: according to the customer's occupation to judge, select the "Worker" branch
The second step: according to the customer's age to choose, choose the Age "<=40" this branch
The third step: According to the customer's academic qualifications to choose, select the "Primary School" this branch, draw the Customer no loan intention of the conclusion
3. How to build a decision tree
The decision tree construction is the process of data splitting, and the steps are as follows:
Step 1: Consider all the data as a node and go to step 2;
Step 2: Select a data feature from all data features to split the node into step 3;
Step 3: Generate a number of child nodes, each child node to judge, if the conditions to stop splitting, enter step 4; otherwise, go to step 2;
Step 4: Set the node to be a child node whose output results in the category with the largest number of nodes.
As you can see from the above steps, there are two important issues in the decision generation process:
(1) How data is divided
(2) How to choose the attribute of splitting
(3) When to stop splitting
3.1 Data Segmentation
Data is divided into discrete and continuous two cases, for discrete data, according to the attribute value of splitting, a property value corresponding to a splitting node; For continuity properties, it is a general practice to sort data by that attribute and then divide the data into sections such as [0,10], [10,20], [20,30 ] ..., an interval corresponding to a node, if the data attribute value falls into a certain interval, the data belongs to its corresponding node.
Cases:
Table 3.1 Classification Information Table
Occupational |
Age |
Whether to loan |
White collar |
30 |
Whether |
Workers |
40 |
Whether |
Workers |
20 |
Whether |
Students |
15 |
Whether |
Students |
18 |
Is |
White collar |
42 |
Is |
(1) attribute 1 is a discrete variable, there are three values, respectively, white-collar, workers and students, based on three values to the original data segmentation, as shown in the following table:
Table 3.2 Properties 1 Split Data
Take value |
Loans |
No loans |
White collar |
1 |
1 |
Workers |
0 |
2 |
Students |
1 |
1 |
Structure represented as a decision tree:
(2) attribute 2 is a continuous variable, where the data is divided into three intervals, respectively [10,20], [20,30], [30,40], then the division results of each interval are as follows:
Interval |
Loans |
No loans |
[0,20] |
1 |
2 |
(20,40] |
0 |
2 |
(40,-] |
1 |
0 |
The structure that is represented as a decision tree is as follows:
3.1 selection of Split attributes
The decision tree uses the greedy thought to divide, namely chooses the attribute which can obtain the optimal splitting result to divide. So what's the best split result? Ideally, it is possible to find a property that separates the different categories, but in most cases it is difficult to one step, and we want the data to be "pure" for the child nodes after each split, for example:
As can be seen from Figure 3.1 and figure 3.2, it is obvious that the child node after the split of attribute 2 is more pure than the child node after the split of attribute 1: The number of two classes of each node after the split of attribute 1 is still the same, and it does not improve with the classification result of the root node; According to the division of the attribute 2, The output is Class 1, and the 2nd child node outputs the result of 2.
Select the split attribute to find properties that make all child node data more pure.
Decision trees use information gain or information gain rate as the basis for selecting attributes.
( 1 ) Information Gain
The information gain is used to denote the data complexity of the pre-splitting and the variation of the data complexity of the splitting node, which is expressed as:
Where gain represents the complexity of the node, the higher the gain, the greater the complexity. Information gain white is the data complexity before splitting minus the data complexity of children's nodes, and the greater the information gain, the more the complexity decreases after splitting, the more obvious the effect of classification.
The complexity of the nodes can be calculated in the following two different ways:
a ) Entropy
Entropy describes the degree of confusion of data, the greater the entropy, the higher the degree of confusion, that is, the lower the purity, conversely, the smaller the entropy, the lower the degree of confusion, the higher the purity. The formula for calculating entropy is as follows:
Take the two classification problem as an example, if the number of the two classes is the same, at this point the purity of the classification node is the lowest, the entropy equals 1; If the node's data belongs to the same class, the node has the highest purity and entropy equals 0.
b ) Gini value
The Gini calculation formula is shown in 3.3:
Similarly, with the example of the above-mentioned entropy two classification, when two classes are equal, the Gini value is equal to 0.5, and when the node data belongs to the same class, the Gini value equals 0 . The larger the Gini value, the more impure the data is.
The following entropy is used as the statistic for complexity:
Attribute 1:
Attribute 2:
Because of this, property 1 is a better split attribute than property 2, so select attribute 1 as the attribute of the split.
( 2 ) information gain rate
The use of information gain as a condition of selective splitting has one unavoidable disadvantage: the tendency to select branches to split more attributes. In order to solve this problem, the concept of information gain rate is introduced. The information gain rate is the information gain (which sounds awkward) on the basis of the information gain divided by the amount of data of the split node, which is calculated as follows:
Which represents the information gain, which represents the information gain of the data volume of the split sub-node, which is calculated as:
where m represents the number of child nodes, which represents the amount of data in the parent node of the sub-node, which is actually the entropy of the split node. the higher the information gain rate, the better the splitting effect.
Or an example of a reference to information gain:
Attribute 1:
Attribute 2:
Because of this, select attribute 2 as the attribute of the split.
3.2 conditions to stop splitting
(1) Number of points in the bar
When the data volume of a node is less than a certain number, it does not continue to split. Two reasons: First, when the amount of data is small, it is easier to divide the noise data, and the second is to reduce the complexity of tree growth. Early termination of the division is somewhat beneficial to reducing the effect of overfitting.
(3) The entropy or Gini value is less than the threshold value.
It is known from the above that the size of the entropy and the Gini value represents the complexity of the data, when entropy or Gini value is too small, indicating the purity of the data is relatively large.
(3) All features have been used and cannot continue to be split.
Passive stop split condition, when there are no attributes to be divided, directly set the current node as a leaf node.
Decision Tree Summary (i)