Decision Tree Summary (i)

Source: Internet
Author: User

absrtact: Decision tree is an important classification and regression method, which has the characteristics of strong readability and fast classification speed. Decision tree is a kind of tree structure, which realizes classification or regression through multiple if-then rules.

1. definition of decision Tree

The tree must be familiar to everyone, the structure of the two elements, the node and the edge. There are several keywords, root node, Father node, child node and leaf node.

The Father node and the child node are relative, plainly speaking the child node is divided by the Father node according to a certain rule, then the child node continues to divide as the new Father node, until it cannot be split. The root node is a node without a father, that is, the initial split point, the leaf node is the child's node, as shown in:

Decision tree is to use the structure of the tree to make decisions, starting with the node, and constantly split the data, the final leaf node reached is the result of the output.

2. How decision trees make decisions

From a simple classification example:

The bank uses a person's information to determine whether the person is interested in the loan, the specific information is as follows:

Occupational

Age

Income

Degree

Whether to loan

Free career

28

5000

High school

Is

Workers

36

5500

High school

Whether

Workers

42

2800

Junior high school

Is

White collar

45

3300

Primary school

Is

White collar

25

10000

Undergraduate

Is

White collar

32

8000

Master

Whether

White collar

28

13000

Dr

Is

Free career

21st

4000

Undergraduate

Whether

Free career

22

3200

Primary school

Whether

Workers

33

3000

High school

Whether

Workers

48

4200

Primary school

Whether

(Note: The data in the above table is fabricated by me, does not have any practical significance)

The decision tree's approach is to use the structure of the tree , each time with an attribute to classify, know to get the results we want or can no longer be divided, as shown in:

Through our training data, we can get the decision tree above, if we want to analyze whether a customer has the intention of loan, we can analyze the result directly according to the customer's information.

If the information of a customer is: {Occupation, age, income, education}={worker, 39, 1800, Primary School}, the following analysis steps and conclusions can be obtained by inputting the information into the decision tree.

The first step: according to the customer's occupation to judge, select the "Worker" branch

The second step: according to the customer's age to choose, choose the Age "<=40" this branch

The third step: According to the customer's academic qualifications to choose, select the "Primary School" this branch, draw the Customer no loan intention of the conclusion

3. How to build a decision tree

The decision tree construction is the process of data splitting, and the steps are as follows:

Step 1: Consider all the data as a node and go to step 2;

Step 2: Select a data feature from all data features to split the node into step 3;

Step 3: Generate a number of child nodes, each child node to judge, if the conditions to stop splitting, enter step 4; otherwise, go to step 2;

Step 4: Set the node to be a child node whose output results in the category with the largest number of nodes.

As you can see from the above steps, there are two important issues in the decision generation process:

(1) How data is divided

(2) How to choose the attribute of splitting

(3) When to stop splitting

3.1 Data Segmentation

Data is divided into discrete and continuous two cases, for discrete data, according to the attribute value of splitting, a property value corresponding to a splitting node; For continuity properties, it is a general practice to sort data by that attribute and then divide the data into sections such as [0,10], [10,20], [20,30 ] ..., an interval corresponding to a node, if the data attribute value falls into a certain interval, the data belongs to its corresponding node.

Cases:

Table 3.1 Classification Information Table

Occupational

Age

Whether to loan

White collar

30

Whether

Workers

40

Whether

Workers

20

Whether

Students

15

Whether

Students

18

Is

White collar

42

Is

(1) attribute 1 is a discrete variable, there are three values, respectively, white-collar, workers and students, based on three values to the original data segmentation, as shown in the following table:

Table 3.2 Properties 1 Split Data

Take value

Loans

No loans

White collar

1

1

Workers

0

2

Students

1

1

Structure represented as a decision tree:

(2) attribute 2 is a continuous variable, where the data is divided into three intervals, respectively [10,20], [20,30], [30,40], then the division results of each interval are as follows:

Interval

Loans

No loans

[0,20]

1

2

(20,40]

0

2

(40,-]

1

0

The structure that is represented as a decision tree is as follows:

3.1 selection of Split attributes

The decision tree uses the greedy thought to divide, namely chooses the attribute which can obtain the optimal splitting result to divide. So what's the best split result? Ideally, it is possible to find a property that separates the different categories, but in most cases it is difficult to one step, and we want the data to be "pure" for the child nodes after each split, for example:

As can be seen from Figure 3.1 and figure 3.2, it is obvious that the child node after the split of attribute 2 is more pure than the child node after the split of attribute 1: The number of two classes of each node after the split of attribute 1 is still the same, and it does not improve with the classification result of the root node; According to the division of the attribute 2, The output is Class 1, and the 2nd child node outputs the result of 2.

Select the split attribute to find properties that make all child node data more pure.

Decision trees use information gain or information gain rate as the basis for selecting attributes.

( 1 ) Information Gain

The information gain is used to denote the data complexity of the pre-splitting and the variation of the data complexity of the splitting node, which is expressed as:

Where gain represents the complexity of the node, the higher the gain, the greater the complexity. Information gain white is the data complexity before splitting minus the data complexity of children's nodes, and the greater the information gain, the more the complexity decreases after splitting, the more obvious the effect of classification.

The complexity of the nodes can be calculated in the following two different ways:

a ) Entropy

Entropy describes the degree of confusion of data, the greater the entropy, the higher the degree of confusion, that is, the lower the purity, conversely, the smaller the entropy, the lower the degree of confusion, the higher the purity. The formula for calculating entropy is as follows:

Take the two classification problem as an example, if the number of the two classes is the same, at this point the purity of the classification node is the lowest, the entropy equals 1; If the node's data belongs to the same class, the node has the highest purity and entropy equals 0.

b ) Gini value

The Gini calculation formula is shown in 3.3:

Similarly, with the example of the above-mentioned entropy two classification, when two classes are equal, the Gini value is equal to 0.5, and when the node data belongs to the same class, the Gini value equals 0 . The larger the Gini value, the more impure the data is.

The following entropy is used as the statistic for complexity:

Attribute 1:

Attribute 2:

Because of this, property 1 is a better split attribute than property 2, so select attribute 1 as the attribute of the split.

( 2 ) information gain rate

The use of information gain as a condition of selective splitting has one unavoidable disadvantage: the tendency to select branches to split more attributes. In order to solve this problem, the concept of information gain rate is introduced. The information gain rate is the information gain (which sounds awkward) on the basis of the information gain divided by the amount of data of the split node, which is calculated as follows:

Which represents the information gain, which represents the information gain of the data volume of the split sub-node, which is calculated as:

where m represents the number of child nodes, which represents the amount of data in the parent node of the sub-node, which is actually the entropy of the split node. the higher the information gain rate, the better the splitting effect.

Or an example of a reference to information gain:

Attribute 1:

Attribute 2:

Because of this, select attribute 2 as the attribute of the split.

3.2 conditions to stop splitting

(1) Number of points in the bar

When the data volume of a node is less than a certain number, it does not continue to split. Two reasons: First, when the amount of data is small, it is easier to divide the noise data, and the second is to reduce the complexity of tree growth. Early termination of the division is somewhat beneficial to reducing the effect of overfitting.

(3) The entropy or Gini value is less than the threshold value.

It is known from the above that the size of the entropy and the Gini value represents the complexity of the data, when entropy or Gini value is too small, indicating the purity of the data is relatively large.

(3) All features have been used and cannot continue to be split.

Passive stop split condition, when there are no attributes to be divided, directly set the current node as a leaf node.

Decision Tree Summary (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.