System implementation and pruning Optimization of Decision Tree Algorithms

Source: Internet
Author: User
Decision tree is a method for in-depth analysis of classification issues. In actual problems, decision trees generated by algorithms are often complex and huge, making it hard for users to understand. This tells us that we should strengthen the research on Tree Pruning while precision of multiclass classification. This article takes the program implementation of a decision tree algorithm as an example to further discuss the problems that may be involved in Tree Pruning optimization. The purpose is to provide a deep and clear simplified technical view for decision tree researchers.

Introduction

Machine learning research classifies data, mainly focusing on prediction accuracy. However, in many actual businesses, only the classification rules of "data prediction structure is easier to understand" are acceptable, it is as clear as the Decision-Making Problem solved by this classification rule. In the field of machine learning and statistics, decision tree induction is widely studied as a solution to classification problems. Because many tree simplification rules are generating Increasingly simple and smaller decision trees, tree simplification rules have become the second focus after prediction accuracy. The key problem of the summary tree simplification technology lies in the diversity of solutions. To control this diversity, you can divide these methods into five categories. Class creation is to generalize the tree as an impromptu search for the expected tree space.

I. Program Implementation of Decision Tree Algorithms

Decision tree induction algorithms are widely used in machine learning, statistics, and other fields that solve classification problems. The decision tree can classify a query case as follows. Given the query Q to be classified, the tree traverses from the root to the leaf node along a path until the class label is assigned to Q. Tests can usually evaluate the characteristics of a case or (such as Boolean or linear) combination feature set. The decision tree algorithm developed by the author includes four inputs:
1. C-training case set, defined by the feature set and a class label
2. R-Test Set, used to divide the training case set into several subsets
3. E () -- evaluation function, used to evaluate the quality of classification results
4. Stop () -- pause the function to determine when to end the decision tree.
Each leaf of the decision tree T output by the algorithm represents a class, which usually has a unique class label. The decision tree uses the recursive segmentation algorithm down from the root. The make-T function generates a decision tree and completes subsequent pruning. There are three possible results: remain the original state, trim, or convert to another data structure. The output tree of the induce-T function T, which inputs the subset C in C, the test set R, the evaluation function E (), and the pause function stop (). Induce-T implements a mountain search, that is, only moving forward and not backward. The stop () determines the status. The initial tree contains a node, root, and all case C. The state changes with the expansion of the tree and is represented by the recursive request of the tree. Each request to the induce-T creates a node that contains the input subset C'. the "best" Test Set for dividing the case C' is determined by the best () function, which relies on E () evaluate the quality of the given test set and classification results, and return the best test set best (). Apply the best () Classification Method P to the case subset C', and return the value V of best (). If the return value does not meet the stop (), the tree continues to expand. The target State set is determined by stop (). When the return value is true, the extension of the end tree is determined. For example, if all examples of C' have the same class label when homogeneous rules are selected, the STOP () function returns true. Homogeneous rules can be used as default suspension rules because they are implicitly included in all suspension rules. The stop () function defines n as a leaf node or an internal node. If it is the latter, the tree will continue to expand: node N is set as an internal node, and the subtree is marked as the test output value V in V; if it is the former, node N is set as a leaf node. The class tag of a leaf node is determined by the case C' it contains. Generally, the selected class tag is common to most cases in case C.
Tree induction algorithms, such as the generation algorithms mentioned here, must be efficient in computing, because building decision trees is a complex task, the search complexity increases exponentially with the depth of the tree (that is, the distance from the root to the lowest leaf. Therefore, the pruning algorithm should be built on finding the test set that maximizes the evaluation function E. Designing and optimizing the evaluation function E () naturally becomes a concern of system developers.

Ii. Issues to be considered during Tree pruning and Optimization

1. Size of decision tree
There are multiple causes for the exception in decision tree. One of them is the improper feature description. Some tree feature description methods cannot accurately establish the target conceptual model. When this description method is used, the target model is very complex. On the contrary, the descriptions used greatly reduce the complexity of the model. Therefore, this problem reminds us that the corresponding description should be taken into account during the Tree pruning process; the other cause of the huge tree is noise. When the case contains a large amount of feature noise (that is, the feature value of the error tag) or class noise (that is, the class value of the error tag, the inductive operation will expand the tree without borders due to irrelevant case features. Noise leads to unrelated cases doping in the selected test set, which will lead to "unnecessary modeling", that is, the tree uses both the target concept and internal noise as the Modeling object. This is a common problem because the given cases contain noise to varying degrees. Although there are multiple ways to cut down the noise, remember that no method is effective for any noise.

A huge tree is usually broken-too many leaves, and each leaf has only a few examples. Such leaves are more prone to classification errors and are more susceptible to noise than those with many examples. These leaf nodes (or, more accurately, their corresponding tree paths) are scattered with low likelihood of occurrence. Thus, another Simplified Method of planting trees is to eliminate this dispersion by cutting out leaves with only a few examples. For whatever reason, the complex tree is not only hard to understand, but also fuzzy classification, because the small scatter is more error-prone than the big scatter. However, it is not easy to identify the authenticity of features in the training set. without considering the impact on the Running Performance of the tree in unknown test cases, pruning trees often reduce the accuracy of training set classification.

2. Weigh accuracy and simplicity
The TRIM method ensures accuracy while improving comprehensibility. These two goals are not necessarily contradictory. There may be ways to improve accuracy and comprehension at the same time. In fact, the initial purpose of introducing the tree simplification method is to control the noise in the training set, and then find that it can improve the accuracy of many noisy datasets.

The control of the degree of simplification (or pruning) has always been a question of immortality. Decision Trees at the expense of precision pruning are often ingenious. However, conservative pruning may greatly improve the accuracy, which is crucial in practical applications. Therefore, many scholars are studying the optimal ratio of accuracy to simplicity of decision trees, but it is difficult to achieve this in the random selection training set, because the case of training sets alone, it is difficult to distinguish which part of the tree is complicated by noise and which part is caused by the tree attributes. Moreover, knowledge in specialized fields is not included in the training set. Therefore, it is necessary to evaluate the noise level, complexity of its own attributes, and the degree to which it should be simplified. The induction algorithms for knowledge sparsity do not involve this. Therefore, each tree assumes the complexity of the model and the level of training noise, which affects the entire process of tree simplification. These algorithms differ from the description deviation. For example, many algorithms assume that the model is in the form of regular dispersion, and test the case feature is a single variable function, while other algorithms assume that feature values can be linear combinations. Naturally, some algorithms are better than others for specific tasks. If you cannot determine which algorithm is best for a given database, run them together and compare the results.

Iii. Decision Tree Pruning Algorithm
Pruning trees have multiple algorithms, which are generally divided into five categories. The most common method is to directly control the tree size by means of pre-pruning (that is, adding a stop rule during tree expansion ), you can also perform subsequent pruning (after the tree is generated, the subtree is cut off), or adjust the size of the tree gradually. Another method is mainly to expand the test set. First, the differences between data-driven or hypothesis-driven components based on features (that is, using the previously established tree to predict component features) are formed, combine or separate the features, and then introduce a multi-variable test set. These methods effectively extend the tree set when adjusting the tree expression. The third category includes selecting different test set evaluation functions, improving the description of continuous features, or modifying the search algorithm itself. Category 4: Database constraints, namely, simplifying the tree by reducing the database or case description feature set. Category 5: convert a tree into another data structure (such as a decision table or decision chart ). These methods can be combined in the same algorithm to enhance their respective functions. For example, the method to control the tree size is often introduced in the search test or search space test. The most common method to simplify decision trees is to control their size during tree creation, including pre-construction and post-pruning.

In the early stage, pruning blocks the decision tree from generating a "full" tree based on the default homogeneous suspension rule. To do this, you must add another suspension rule to the tree generation process. The simple pruning method of the Free restriction tree is very suitable. However, generally, the suspension rule will evaluate the effect of the tree's continued expansion. If the utility is zero (or very small), the expansion will be stopped. Or the subsequent processing has little impact on the output form of the tree. Pre-pruning is more effective than post-pruning because it ends the tree generation phase as early as possible. The suspension Rule filters out related and irrelevant test sets at the same time. In addition, when you select a test set and trim the same test constraint that is partial to a node, a problem may occur, because the absolute value of the constraint often varies according to the sample size. Pre-pruning will cause the tree to stop before it is not completely mature, that is, the tree may stop expansion when it should not be stopped, or it is called the horizon effect. Even in this way, it is worth studying for large-scale practical applications in the early-stage pruning, because it is quite efficient. People are expected to solve the horizon effect in future algorithms.

Post-pruning is a simplified method of tree that is widely concerned. The post-pruning algorithm inputs an untrimmed tree T and outputs the tree T obtained after one or more sub-trees are trimmed '. The algorithm does not search for every possible t'', but uses heuristic search. Trim the subtree into a leaf, and then replace the internal node with the leaf node. Unlike the previous trim, the later trim does not use a function to remove details, but directly uses the default homogeneous pause rule. When the tree is redundant to the training set (that is, noise participates in modeling), pruning can effectively improve accuracy. For example, if the leaf node class label containing N of the training set is n '<= N for a given training set m, the alternative error rate is (n-n')/m, the lower-layer leaf nodes have the least impact on the replacement accuracy, so they are first trimmed. The post-pruning method uses a variety of evaluation functions to determine whether to weaken a node or enhance the accuracy of the case set. Proper pruning can improve the classification accuracy. Especially when the noise level of the training set is relatively high, the trim is quite effective.
Some post-pruning methods divide the training set into two subsets. One is used to generate a tree, and the other is used for post-pruning. This can be used to generate and trim a set. The build set is usually used to generate a tree. Then, the tree set S is generated based on the trim variable. The Trim set selects an Optimal Tree from S. The exception is that the trim set is used to trim the tree, not just the selection. The pruning set method has the advantage of generating a tree set rather than a tree. In particular, when expert algorithms in the field are not satisfied with the decision tree, artificial selection can be performed from the tree set, and tree set induction can improve the prediction accuracy; the disadvantage of dividing a training set into two subsets is that design decisions are affected by humans. A small training set may produce a small tree because the lower part is cut off, however, this is precisely why we should choose the largest training set as much as possible. Reducing the size of the training set is equivalent to increasing the uncertainty of the prediction accuracy of pruning. You can trim multiple algorithms in the future, such as MCCP, rep, MEP, CVP, Pep, and EBP. The precision and simplification of these algorithms are different. Interested readers can refer to relevant materials, this is not detailed here.

Conclusion
The trade-off between accuracy and simplicity may be the topic that decision trees can never escape. However, it is important to remember this, that is, even if we focus on pruning trees, however, precision cannot be ignored. If the pruning tree causes a significant reduction in precision, the pruning will be meaningless.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.