Machine Learning-Decision Tree

Last Update:2018-12-05 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Since we talked about the random forest last time, and the random forest is composed of multiple decision trees, let's take a closer look at the decision tree.

There are already good blog posts about decision trees in the blog. This article details the structure of ID3 and C4.5 decision trees. This blog post focuses on how to determine the decision of each node in the tree.Best splitting attributeAndPruning.

1. determine the best splitting attribute

Generally, the decision tree uses ID3 (Quinlan 1986) as an example. The ID3 algorithm uses information gain. I will not go into details about the information gain. On node N of the decision tree, the ID3 algorithm selects the INPUT attribute with the highest information gain after classification using the INPUT attribute on the training sample set D corresponding to the node. Information gain is defined :. S is the training sample set on node N, and A is an INPUT attribute. For all input attributes available on node N, we select the attribute with the largest information benefit value for splitting. Because the first item in the above formula is the same for all input attributes, the smaller the second item, the larger the information gain value.

In fact, in addition to information gain (I .e. entropy), there are many other methods to determine the best splitting attribute. They are all referred to as impurity measure ). A certain attribute is used to divide the training sample set. If all samples in each branch belong to the same class after Division, the Division is pure, if the samples in each Division belong to many different classes, the Division is not pure.

We use non-pure measurements to measure the purity of the training sample with a certain attribute. The higher the measurement value, the less pure the partition (because this is a non-pure measurement, not a pure measurement ), we try to select input attributes with low purity when splitting each node. The entropy value can be used as a non-pure measurement. The smaller the entropy value, the larger the information gain, and the smaller the non-purity, the better the division. In addition to entropy values, the following non-Purity Measurement methods can also be used (representing only the non-Purity Measurement within a division, for the non-Purity Measurement of the overall division, you can use the weighted sum of the non-purity measurement values of each division ):

(1) Gini Index: In which fi is the ratio of training samples belonging to the I classification in the classification. The Gini coefficient was proposed by Breiman in 1984, it is mainly used in CART (Classfication and Regression Tree.

(2) misclassification error: 1-max (f1, f2,..., fm ).

2. pruning

When creating a decision tree, the condition for stopping splitting is (1) all samples on the node belong to the same category or (2) There is no usable INPUT attribute.

However, the decision tree trained based on such conditions will often overfit the training data, that is, the model has a very high classification accuracy for the training data, but for new data, the classification accuracy is poor. There are two ways to avoid over-fitting:

(1) Stop tree growth as early as possible, that is, first pruning (prepruning ).

For example, if the number of training samples for a node in a decision tree is smaller than a certain percentage of the entire training set (for example, 5%), the node is not split, instead, the training sample of the node is "majority voting ". Because a decision tree based on a small number of instances may lead to a relatively poor size, resulting in a large Generalized Error (Generalization Error ).

(2) postpruning ).

There are many methods for post-pruning. Here we only introduce a variant of the training and Validation Set methods:

First let the decision tree grow completely, and then we find out the child tree that is over-fitting and cut it out. The specific practice is: we extract part of the previous training data as the verification set, and the rest is the training set. Use a fully-increasing decision tree in the training area of the training set. Then, for each subtree in the decision tree, we use the leaf nodes that contain all the training samples of the subtree, the result of the leaf node classification is "majority voting. We use a verification set to test the classification performance changes before and after replacement. If the classification accuracy increases after replacement, we have reason to think that the previous subtree is too complex and over-fitting the training sample in the training set, so we will replace it with the leaf node. Otherwise, it will not be replaced.

The common practice is to use 2/3 of the available training samples as the training set, and the remaining 1/3 as the verification set.

The two pruning methods are compared. First, the pruning is faster, and you can build a decision tree at the same time. Then, pruning is slow. After the decision tree is constructed, each subtree is replaced and verified, but its accuracy is higher than that of first pruning.

References:

[1] grocery stores-Decision Tree of Classification Algorithms)

[2] machine learning Tom M. Mitchell

[3] Introduction to machine learning ethern alpaydin

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More