Patterns Recognition (Pattern recognition) Learning notes (29)--pruning of decision trees

Source: Internet
Author: User

Under the limited sample, if the decision tree grows very large, the branches are many, then it is possible to cause the limited sample to be more sensitive to the chance or noise of sampling, which leads to the learning and thus the poor ability of the fan.

First look at a picture,


is a test using the ID3 algorithm to obtain the size of the decision tree and the training data and test data on the relationship between the correct rate, it is not difficult to see, there has been learning, if the sample is not enough, with the decision tree reached a certain size, training data on the correct rate will continue to increase, and the test data on the correct rate does not increase So the algorithm for building decision trees like this one that grows to a leaf node containing only a single class of samples is flawed, and our goal is to balance the correctness of the training data with the correct rate on the test data, which, like all pattern recognition problems, must ensure good generalization capability.

In the decision tree algorithm, controlling the performance of the algorithm, preventing the main means of learning, is to control the decision tree generation algorithm termination conditions and the decision tree pruning, and the method of controlling decision tree size is pruning (prunning), there are two main strategies: first pruning and post-pruning.

First pruning

Definition: Control the growth of decision trees, determine whether a node needs to continue branching or direct as a leaf node in the decision tree growth process, once a node is judged as a leaf node, the branch stops growing.

A way to determine when a decision tree stops growing:

1) Data Partitioning method: The sample data is divided into training samples and test samples, first grow the decision tree on the training sample, until the correct rate on the test sample to reach the maximum stop growth.

Disadvantage: Because we want to divide the data sample, we use only a part of the total sample in training, and do not make full use of the data information, so we need multiple cross-validation;

2) Threshold method: Set a suitable information gain (small amplitude of entropy impurity) threshold value, when the current node is found that the information gain is less than the threshold value, stop growing.

Disadvantage: The selection of threshold value is not very good grasp;

3) Statistical significance of information gain analysis: The distribution of all the information gain of the existing node, if the growth of the information gain is not significant compared with the branch, stop growing;

Disadvantages: Not intuitive, the significance is not very good inspection, may consider using Chi square distribution;

Post-pruning

Definition: Post-pruning is to wait until the decision tree growth is completed and then optimized and trimmed, mainly to the merger of some branches; from the leaf node, if the leaf node with the same parent node will not result in a significant increase in entropy, then the elimination, and the parent node as a new leaf node, so that the continuous backtracking, Until it is no longer suitable for merging the branches.

After pruning, the following three principles can be followed in the process of merging branches:

1) Reduce the classification error rate pruning method: estimate the classification error rate changes before and after pruning, based on the error rate to determine whether the branch should be combined;

2) The minimum cost and complexity of the balance: at the same time consider two indicators: the error rate after pruning and the complexity of the reduction, the compromise between the two, and finally get a comprehensive performance decision tree;

3) Minimum description length criterion: Based on the idea of "simple is happiness", first, the decision tree is coded, and the shortest decision trees are pruned.

Summarize

The selection of first pruning and post pruning needs to be analyzed concretely with specific problems. First pruning looks more straightforward, but its difficulty lies in the need to determine when to stop the growth decision tree, because the decision tree growth process, each step is only based on the current criteria (is a greedy algorithm), no global view, and does not backtrack, so the first pruning lack of the final effect of consideration, may lead to premature growth termination. For the post-pruning (in practice, the use of the more successful, C4.5 and cart in the use of the post-pruning pruning), one to make full use of all the sample information, and secondly, considering the global (compensation mechanism), the defect is that the sample data is large enough to increase the computational burden of heavier, so in practice can also be combined.




Patterns Recognition (Pattern recognition) Learning notes (29)--pruning of decision trees

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.