The ten classical algorithms of data Mining--cart: Classification and regression tree

Source: Internet
Author: User
Tags id3

I. Types of decision Trees
In data mining, there are two main types of decision trees:

The output of the classification tree is the class label of the sample.
The output of a regression tree is a real number (such as the price of a house, the time a patient spends in a hospital, etc.).

The term classification and regression tree (CART) includes the above two decision trees, which are first presented by Breiman and so on. The classification tree and the regression tree have some similarities and differences-such as dealing with the problem of where to split.

Categorical regression trees (cart,classification and Regression tree) also belong to a decision tree, before we introduced decision trees based on ID3 and C4.5 algorithms.

This is just about how the cart is used for classification.

Categorical regression tree is a binary tree, and each non-leaf node has two children, so for the first subtrees tree its leaf nodes than the number of non-leaf nodes more than 1.


The difference between cart and ID3:
The impure measure used to select variables in the cart is the Gini exponent.
Assume that the target variable is nominal and that it has more than two categories. The cart may consider merging the target category into two superclass (double);
Assume that the target variable is contiguous. The cart algorithm finds a set of tree-based regression equations to predict the target variables.


Ii. Building Decision Trees

When building a decision tree, you typically use a top-down approach, choosing one of the best attributes to split in each step.

The "best" definition is to make the training set in the child nodes as pure as possible. Different algorithms use different metrics to define "best". This section describes one of the most common metrics.


There are 4 different non-pure measures can be used to find the division of CART model, depending on the type of target variable, for the classification of the target variable, can choose Gini, double or ordered double;
For a continuous target variable. can use least squares deviation (LSD) or minimum absolute deviation (LAD).

Below we only talk about the Gini index.


Gini index:
1, is a measure of inequality;
2, pass often used to measure income imbalance, can be used to measure no matter what uneven distribution;
3, is the number between 0~1, 0-completely equal. 1-Completely unequal.
4, the overall inclusion of the more chaotic categories, the Gini index is greater (with the concept of entropy very similar)

Cart Analysis Steps

1. Start from the root node t=1. Search from all possible candidate S sets to reduce the maximum partition s*, then divide the Node 1 (t=1) into two nodes t=2 and t=3 using partition s*;
2, the search process is divided repeatedly on t=2 and t=3.

Gini Non-purity index
In the cart algorithm, the Gini purity represents the possibility of a randomly selected sample being divided in a subset.

The probability of the sample being selected is multiplied by the probability of its being divided by the Gini impurity. When all the samples in a node are a class, the Gini purity is zero.

If the possible value of Y is {1, 2, ..., m}, so that the fi is the probability that the sample is given to I, then the Gini index can be calculated by example:

Like what:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvdtaxmta2nzm2ma==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

The previous example is a property with 8. How many discrete values are desirable for each property. On each node of the decision tree we are able to divide by any one of the values of any one of the properties. For example, at the beginning we press:

1) surface covering for hair and non-fur

2) surface covering for scale and non-scale

3) body temperature is constant gentle non-constant temperature

And so on produces the current node of the left and right two children. Below we are divided by Gini index :

Gini index

The more cluttered the group is, the greater the Gini index (which is very similar to the concept of entropy).

For example, when the temperature is constant temperature, including 5 mammals, 2 birds, then:

gini=1-[(\frac{5}{7}) ^2+ (\frac{2}{7}) ^2]=\frac{20}{49} "width=" 232 "height=" "style=" border:0px ">

When the body temperature is non-isothermal, including Reptiles 3, fish 3, amphibians 2, then

So assuming that the "body temperature is constant gentle non-isothermal", we get the gain of Gini (analogy information gain):

The best division is to make Gini_gain the smallest division.

Termination conditions

After a node produces left and right children, it is possible to divide the children recursively to produce categorical regression trees. What is the termination condition here? When will the node be able to stop splitting? Intuitively, when a node includes data records that belong to the same category, it can terminate the split.

This is only a special case. More generally, we calculate the χ2 value to infer the degree of correlation between classification conditions and categories. When the χ2 is very small, it is not reasonable to classify the classification conditions and categories, that is, it makes no sense for them to be classified according to the classification conditions, when the nodes stop splitting. Note that the "classification condition" here refers to the "classification condition" obtained according to the Gini_gain minimum principle.

If we get the "classification condition" in the first step of constructing the classification regression tree, the temperature is constant and gentle and non-constant. At this time

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvdtaxmta2nzm2ma==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">


Third, pruning

What is the decision tree (why) to prune? the reason is to avoid the decision tree over-fitting (Overfitting) sample.

The decision tree generated by the preceding algorithm is very specific and large, and each attribute is specifically considered. The training samples covered by the leaf nodes of the decision tree are "pure". So use this decision tree to classify the training samples. You will find that for training samples. The tree is well-behaved. The error rate is very low and the samples in the training sample set can be correctly categorized. The error data in the training sample will also be learned by the decision tree as part of the decision tree, but the performance of the test data is not as good or as bad as it is supposed to be, and this is the so-called overfitting (Overfitting) problem. Professor Quinlan, in the data set. The error rate of an over-fitting decision tree is higher than the error rate of a simplified decision tree.

How to Prune

The question now is how to generate a simplified decision tree based on the native overfitting decision tree? It is possible to simplify the decision tree of overfitting by pruning.

pruning can be divided into two types : Pre-pruning (pre-pruning) and post-pruning (post-pruning), following which we will study these two methods:
Preprune: Pre-pruning. Early Stop tree growth. Method can be used to see how the tree stops growing.
Postprune: After pruning, pruning is performed on the generated fit decision tree. Can get a simplified version of the pruning decision tree.
In fact, the criterion of pruning is how to determine the size of the decision tree. The following are some of the pruning ideas that can be tested:
1: Use the training set (Training set) and the validation set (Validation set) to evaluate the pruning method's usefulness on the trimming node.
2: Use the entire training set for training. However, the statistical test is used to predict whether pruning a particular node will improve the performance of the data outside the training set, such as using the Chi-Square (quinlan,1986) test to further expand the node to improve the performance of the entire categorical data. Or only improves the performance on the current training collection data.
3: Use clear criteria to measure the complexity of training examples and decision trees, stopping the tree growth when the length of the encoding is minimal. such as the MDL (Minimum Description Length) guidelines.


1, Reduced-error pruning (REP, reduction in error rate pruning)
The pruning method considers each node in the book as a candidate for pruning. Deciding whether to trim this node is composed of the following steps, for example:
1: Delete subtree with this node as root
2: Make it a leaf node
3: Most common classification of training data associated with this node
4: When the trimmed tree does not have a better performance than the original tree for validating the collection, the node is actually deleted.
Because of the overfitting of the training set, the validation set data can be modified, repeated the above operation, and the processing node from the bottom up. Delete the nodes that maximize the accuracy of the validation collection. Until further pruning is harmful (harmful means that trimming reduces the precision of the validation set)
Rep is one of the simplest post-pruning methods, only when the amount of data is less than the case. The rep method tends to be over-fitted and less used. This is due to the fact that the attributes in the training data set are ignored during pruning. Therefore, it is important to be aware of this problem when validating data sets that are smaller than the training data set.
While rep has this drawback, it's just that rep still evaluates the performance of other pruning algorithms as a benchmark.

It provides a very good learning idea for the advantages and disadvantages of two-stage decision tree learning methods.

Because the validation set does not participate in the creation of the decision tree, the decision tree after the rep pruning is much better than the deviation of the test example, which can solve some degree of overfitting problem.

2, pessimistic Error pruning (PEP. Pessimistic pruning)
The accuracy of the rule in the training example applied to it is calculated first. It is then assumed that this expected precision is a two-item distribution. and calculates its standard deviation.

For a given confidence interval. The use of the nether is expected as a measure of rule performance. The result of this is that for large data sets, the pruning strategy can be very close to the observation precision, and as the data set decreases, the accuracy of the observation is farther and further away. Although the pruning method is not statistically effective, it is effective in practice.


PEP is designed to improve the prediction reliability of test sets. THE PEP adds a continuity correction (continuity Correction) to the error estimate.

The PEP method feels, suppose:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvdtaxmta2nzm2ma==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "> was established, then TT should be pruned,

In the formula:


E (t) is the error of the node T, and I is the leaf node covering the TT. NT is a subtree of the leaf of the sub-tree TT. N (t) is the number of training sets at the node T. THE PEP adopts a top-down approach, assuming that a non-leaf node conforms to the above inequalities. Cut off the leaf node.

The algorithm is considered to be one of the most high-longitude algorithms in the decision tree post-pruning algorithm. But there is a flaw in hunger. First, THE PEP algorithm is the only use of the Top-down pruning strategy. Such a strategy would result in the same problem as the first pruning, where one of the nodes of the node does not need to be pruned, and the PEP method will have a pruning failure.
Although the PEP method has some limitations, it shows a high precision in practical application.

The two external PEP methods do not need to separate the training set and the verification machine. It is more advantageous to have less data volume. Moreover, its pruning strategy is more efficient and faster than other methods. Since in the pruning process, each subtree in the tree has to be visited at most, and in the worst case, its computational time complexity is only linearly related to the number of non-leaf nodes in the non-pruning tree.

Cost-complexity pruning (CCP, cost complexity)
The CCP approach consists of two steps:
1: Create a subtree sequence from the original decision tree T0 {T0, T1, T2 、...、 Tn}, where ti+1 is always generated from Ti, TN is the root node
2: From the subtree sequence, the optimal decision tree is expected to be selected based on the true error of the tree.

For each non-leaf node in the categorical regression tree, its surface error rate gain value α is computed.

is the number of leaf nodes included in the subtree;

R (t) "width=" "height=" "style=" border:0px "> is the error cost of node T. Suppose that the node is pruned;

R (t) =r (t) *p (t) "width=" 121 "height=" "style=" border:0px ">

R (t) is the error rate of node T;

P (t) is the proportion of data on the node T that is the total data.

R (t_t) "width=" height= "style=" border:0px "> is the error cost of the sub-tree TT, assuming that the node is not pruned. It is equal to the sum of the error cost of all leaf nodes on the sub-tree TT.

For example, there is a non-leaf node t4 see:

For example, there is a non-leaf node t4 see:

All data are known to have a total of 60 together. The node error cost of node T4 is:

Sub-tree error costs are:

R (t_t) =\sum{r (i)}= (\frac{2}{5}*\frac{5}{60}) + (\frac{0}{2}*\frac{2}{60}) + (\frac{3}{9}*\frac{9}{60}) =\frac{5}{60 } "width=" 410 "height=" style= "border:0px" >

There are 3 leaf nodes on the subtree with T4 as the root node, and finally:

A non-leaf node with the lowest alpha value is found, leaving the child null for the left and right.

When the alpha value of multiple non-leaf nodes is minimized at the same time, take

| n_{t_t}| "width=" height= "style=" border:0px "> the largest pruning.

The pruning process is particularly important, so it occupies an important position in the optimal decision tree generation process.

Studies have shown. The importance of pruning processes is more important than the tree generation process. For the largest trees (Maximum tree) generated by different partitioning criteria, the most important attribute partitioning can be preserved after pruning, with little difference. The pruning method is more critical to the generation of the optimal tree.

The ten classical algorithms of data Mining--cart: Classification and regression tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.