"Python data Mining" decision tree

Last Update:2017-09-03 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tag: equals URI calculation fonts using BSP Max Ida sample

Definition of decision Tree

The decision tree (decision tree) is a tree structure (can be a two-fork tree or a non-binary). Each of its non-leaf nodes represents a test on a feature attribute, and each branch represents the output of the feature attribute on a range of domains, and each leaf node holds a category. The process of decision making using decision tree is to test the corresponding feature attribute in the classification item from the root node, and select the output branch according to its value until the leaf node is reached, and the category of leaf node is stored as the decision result.

A tree is a structure that consists of nodes and edges of two elements. To understand a tree, you need to understand several key words: root node, parent node, child node, and leaf node.

The parent and child nodes are relative, and the child nodes are divided by the parent node according to a rule, and the child node continues to split as the new Father node until it cannot be split.

The root node is a node that has no parent node, the initial split node, and the leaf node is a node that has no child nodes, as shown in:

Decision trees make use of the tree structure shown in the decision, each non-leaf node is a judging condition, each leaf node is the conclusion. Starting with the node, the conclusion is drawn after many judgments.

How decision trees make decisions

Case study: Predict that you can't go out to play today

Step 1: Consider all the data as a node and go to step 2;

Step 2: Select a data feature from all data features to split the node into step 3;

Step 3: Generate a number of child nodes, each child node to judge, if the conditions to stop splitting, enter step 4; otherwise, go to step 2;

Step 4: Set the node to be a child node whose output results in the category with the largest number of nodes.

Data segmentation

The data types of split attributes are divided into discrete and continuous cases.

For discrete data, split by attribute value, each attribute value corresponds to a split node;

For continuity properties, it is a general practice to sort the data according to that attribute, and then divide the data into sections such as [0,10], [10,20], [20,30] ..., an interval corresponds to a node, and if the attribute value of the data falls into a certain interval, the data belongs to its corresponding node.

Selection of Split attributes

The decision tree uses the greedy thought to divide, namely chooses the attribute which can obtain the optimal splitting result to divide.

A standard (metrics) for measuring purity (purity) needs to be symmetric (4 Yes/0 no = 0 Yes/4 No)

a ) Entropy

Entropy describes the degree of confusion of data, the greater the entropy, the higher the degree of confusion, that is, the lower the purity, conversely, the smaller the entropy, the lower the degree of confusion, the higher the purity.

where pi represents the proportion of the number of Class I.

Take the two classification problem as an example, if the number of the two classes is the same, at this point the purity of the classification node is the lowest, the entropy equals 1; If the node's data belongs to the same class, the node has the highest purity and entropy equals 0.

Chestnuts:

H (S) =-P ( Yes ) logp ( Yes )-P ( no ) Logp ( no )

Calculation 3 Yes /3 No: H (S) =-3/6LOG3/6 -3/6LOGP3/6 = 1

Calculation 4 Yes /0 No: H (S) =-4/4log4/4 -0/4LOGP0/4 = 0

b) Information gain

The information gain is used to represent the data complexity and the variation of the data complexity of the splitting node before and after splitting, and the formula is expressed as:

Where gain represents the complexity of the node, the higher the gain, the greater the complexity.

Information gain is the complexity of data before splitting minus the data complexity of children's nodes, the greater the information gain, the more the complexity decreases after splitting, the more obvious the effect of classification.

Common algorithms

The decision tree construction algorithm mainly has ID3, C4.5, cart three kinds, among them ID3 and C4.5 is the classification tree, the cart is the classification regression tree

ID3 algorithm:

1) The information gain of all the attributes is calculated for the current sample set;

2) Select the attribute with the most information gain as the test property, The sample with the same value of the test attribute is divided into a sample set.

3) If the category attribute of the sample set contains only a single attribute, then the leaf node, judge the property value and mark the corresponding symbol, then return to the place of adjustment; Otherwise, the sample set is recursively tuned to this algorithm.

Transition Fitting (overfitting)

The decision tree generated by the above algorithm often leads to filtering fitting in the event. That is, the decision tree can get very low error rate for training data, but it gets very high error rate when applied to test data. The reasons for transition fitting are as follows:

Noise data: There is noise data in the training data, some nodes of decision tree have noise data as segmentation standard, which makes the decision tree cannot represent real data.

Lack of representative data: the training data does not contain all representative data, such as splitting under the conditions of the date, resulting in a class of data that does not match well, which can be obtained by observing the confusion matrix (confusion matrix) analysis.

Multiple comparisons: You can always divide n data into 100% pure n groups

Optimization Scenarios :

1. Reduce unnecessary divisions, and specify the percentage of randomness that results from a split condition, which exceeds a certain percentage as a condition of disaggregation

2, Branch prune (according to Validationset verification set )

If the data set is 14, using 10 as the training data, 4 is the validation set, when the decision tree combination, the artificial reduction of the branch, using the validation set can be divided into pure division, determine whether the branch is superfluous.

C4.5 algorithm

One problem with the ID3 algorithm is that it favors multi-valued attributes.

For example, if there is a unique identity attribute ID, then ID3 chooses it as a splitting attribute, which makes the partition sufficiently pure, but it is almost useless for classification.

The successor algorithm of ID3 C4.5 uses the gain rate (Gain Ratio) information gain expansion to try to overcome this bias.

Each symbolic meaning is the same as the ID3 algorithm, and then the gain rate is defined as:

C4.5 chooses the attribute with the maximum gain rate as the splitting attribute, its specific application is similar to the ID3, no longer repeat.

"Python data Mining" decision tree

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More