Cart Classification and regression Tree Learning notes

Source: Internet
Author: User

Cart:classification and regression tree, classification and regression trees. (is a two-fork tree)

Cart is a decision tree, mainly by feature selection, tree formation and pruning three parts. It is mainly used to deal with classification and regression problems, which are described separately below.

1. Regression tree: Using the minimum square error criterion

The training set is: d={(x1,y1), (X2,y2), ..., (Xn,yn)}.

Output y is a continuous variable, dividing the input into M-regions, R1,R2,..., RM respectively, and the output values for each region are: c1,c2,..., cm The regression tree model can be expressed as:

The squared error is:

If you use the value s of feature J to divide the input space into two regions, respectively:

We need to minimize the loss function, namely:

The c1,c2 are the output mean of the R1,R2 interval respectively. (Here is different from the statistical learning textbook formula, in the textbook inside the C1,C2 all need to take the minimum value, but in the determined interval, when the average value of the C1,C2 interval output values of the square will be minimized, for the sake of simplicity, therefore, the output mean of the interval is directly used. )

In order to minimize the squared error, we need to iterate through each of the features in turn, calculate the current error of each possible segmentation point, and finally select the minimum shard error points to divide the input space into two parts, and then recursively pass the above steps until the end of the segmentation. The tree that this method slices is called the least squares regression tree.

Least squares regression tree generation algorithm:

1) loop through each feature J, and each value s of the feature, calculate the loss function for each Shard point (j,s), and select the minimum splitting point of the loss function.

2) Divide the current input space into two parts using the cut points from the previous step

3) The dividing points are then computed again after the divided two sections, and so on, until they cannot be divided.

4) Finally divide the input space into M-zone r1,r2,..., RM, the resulting decision tree is:

Where CM is the average of the output value of the area.

Summary: The complexity of this method is high, especially in each search for the segmentation point, it is necessary to traverse all the current characteristics of all possible values, if there is a total of F features, each feature has N values, the resulting decision tree has an S internal node, the algorithm's time complexity is: O (f*n*s)

2. Classification tree: Using the Gini index minimization criterion

Gini index: If there is a total of k, the probability that the sample belongs to Category K is: PK, then the Gini index of the probability distribution is:

The larger the Gini index, the greater the uncertainty.

For Class II classifications:

Using feature a=a, D is divided into two parts, D1 (a collection of samples that satisfy a=a), D2 (a collection of samples that do not meet a=a). The Gini index of D under the condition of the characteristic a=a is:

Gini (d): Represents the uncertainty of set D.

Gini (A,d): Represents the uncertainty of set D after A=a segmentation.

Cart Generation algorithm:

1) Iterate through the possible value a of each feature a, and calculate its Gini index for each shard point (A, a).

2) Select the minimum segmentation point of the Gini index as the optimal segmentation point. It then uses the Shard point to cut the current dataset into two subsets.

3) The two subsets of the upper step are recursively called 1) and 2, respectively, until the stop condition is met. (The algorithm stops when the number of samples is less than the predetermined threshold, or the Gini index of the sample set is less than the predetermined threshold or no more features)

4) Generate cart decision tree.

3, cart Tree pruning

The decision tree generated by the cart is recorded as T0 andthen pruned fromthe bottom of T 0 until the root node. In the process of pruning, the loss function is calculated:

  

Note: The parameters are shown here for easy editing using a.

A>=0,c (T) is the predictive error for the training data | T| is the complexity of the model.

For a fixed a, there must be a tree ta in the T0 so that the loss function CA (T) is the smallest. That is, every fixed a, there is a corresponding tree that makes the loss function the smallest. So different a will produce different optimal trees, and we do not know which one is best in these optimal trees, so we need to divide a into a series of regions in its value space, take a a in each region and then get the corresponding optimal tree, and finally choose the best tree with the least loss function.

Now take a series of values for A, respectively: a0<a1<...<an<+ Infinity. Produces a series of intervals [ai,ai+1]. Take a value AI in each interval, and for each AI, we can get an optimal tree Tai. So we get an optimal tree list {t0,t1,..., Tn}.

So, for a fixed a, how to find the best sub-tree?

Now if node T is a tree, a single-node tree, then its loss function is:

Ca (t) =c (t) +a*1

For a tree with node T as the root node, its loss function is:

Ca (TT) =C (TT) +a*| Tt|

When a=0, there is no pruning, CA (t) > Ca (Tt). Because the effect of using a decision tree classification is certainly better than dividing all the samples into one class. Even if there is a fit.

However, as a increases, the size relationship between CA (T) and CA (TT) changes (that is, CA (t)-CA (TT) decreases with a monotonically. Just guessing, without proof). So the CA (t) = CA (TT), that is, T and Tt have the same loss function, and T has fewer nodes, so choose T.

When CA (t) = CA (Tt), that is:

According to the analysis, in T0 each internal node T, calculates the value of a, which represents the extent to which the overall loss function decreases after pruning. In the T0 to cut a minimum of the sub-tree TT, the resulting new tree is recorded as T1, while the A is recorded as A1. That is, T1 is the optimal tree on the interval [A1,A2].

Analysis of the relationship between A and loss function: When a=0, there is no pruning at this time, because the resulting overfitting, so the loss function will be larger, and with the increase of a, the resulting overfitting will slowly fade, thus, the loss function will slowly decrease, when a increases to a value, the loss function will appear a threshold value, Thus, if a exceeds this critical value, the loss function becomes more and more simple and begins to increase. So we need to find a critical point A that minimizes the loss function.

How do you find the A that minimizes the loss function? We try to iterate through each internal node in the spanning tree, calculate the overall loss function when the internal node is cut off and not cut off the internal node, and when the loss function of the two cases is equal, we can get a, which represents the minimum a that is currently needed for pruning. This allows each internal node to calculate an A. This represents the extent to which the overall loss function is reduced.

So which a can you choose to prune the spanning tree? We choose the smallest of the above calculated a to do pruning. If we choose not the smallest a for pruning, then there are at least two internal nodes that can be pruned, so that the loss function after pruning will be greater than the loss of only one place (this sentence may not be accurate), in order to make the loss function is minimal, so choose the smallest a to do pruning.

After selecting a, we need to calculate the sub-tree that corresponds to the minimum loss function. That is, from the root node of the tree, each internal node is iterated through each layer to determine if pruning is required at each internal node. The tree after pruning is the tree we need.

Cart Pruning algorithm:

1) Set k=0,t=t0, a=+ infinity

2) traverse the internal nodes from the bottom up to calculate C (Tt), | Tt| and g (t)

A=min (A, g (t));

3) from top to bottom access to each internal node T, if G (t) =a, then pruning, and t by the majority of votes in the manner of determining its class. Get the tree T

4) If T is not a tree consisting of the root node alone, repeat step 3 to get a series of subtrees.

5) Finally, cross-validation is used to select the optimal subtree from the subtree sequence.

Reference documents:

[1] Hangyuan Li, statistical learning methods.

Cart Classification and regression Tree Learning notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.