CART of Decision Tree

Source: Internet
Author: User
Tags id3

Following the ID3 and C4.5 of the decision tree in the previous article, this paper continues to discuss another binary decision tree classification and Regression Tree,cart was proposed by people in 1984, is a widely used decision tree algorithm, different from Breiman With C4.5, CART is a two-point decision tree, each time the characteristics of the segmentation will only produce two sub-nodes, and ID3 or C4.5 the branch of the decision tree is based on the selected characteristics of the value of the segmentation feature has the number of different values, there are how many sub-nodes (continuous feature discretization can be). CART design regression and classification, the next step is to introduce the classification tree and the regression tree.

Regression Tree and Model tree

The first simple recall is linear regression: for datasets $D = {(X_1,y_1), (x_2,y_2),..., (x_n,y_n)}$, the model for linear regression is:

\[\bar{y}_i = \theta ^t x_i \]

The loss function can be recorded:

\[l (\theta) = \frac{1}{n}\sum_i (\bar{y}_i-y_i) ^2\]

Liner Regression is already very powerful, can handle linear data sets, for some samples are curved points, you can consider the introduction of higher-order linear regression, or local weighted regression, local weighted regression model with linear regression, but the loss function has some improvements:

\[l (\theta) = \frac{1}{n}\sum_iw_i (\bar{y}_i-y_i) \]

\[w_i = exp \left (\frac{x_i-x}{2 \tau ^2} \right) \]

Here $\tau$ control the rate of change of weight, each sample $x _i$ has a weight, weights according to $\tau$ to adjust the size of the sample point $x _i$ closer to the samples of the weight is large, the local weighted regression model with linear regression, the local weighted regression and linear regression graph are as follows:

Linear regression is a model of global data, the entire data set as a goal optimization, when the data does not appear linear, such as the data shown, the global sharing of an optimization goal is obviously not a good choice, when the data can be divided, piecewise linear regression of partitioned data is a good choice. The data shown can be divided into 5 shards to be processed separately, after the Shard can be a model for each shard, this situation can be called model tree , or simply the data after the Shard, this is the most common regression tree . Next, the regression tree and the model tree are introduced respectively.

The regression tree uses the mean square error as the loss function, and when the tree is generated, it divides the space recursively according to the optimal characteristic and the optimal value, until the stopping condition is satisfied, the stopping condition can be set artificially, for example, the sample capacity of a node is set < min$ is no longer segmented Or when the cut loss value $< \varepsilon $, the slice is stopped and the leaf node is generated.

For the resulting regression tree, the category of each leaf node is the mean value of the label falling to the leaf node data, assuming that the feature space is divided into $M $ parts, that is, there are now $M the leaf nodes are $R _1,r_2,..., r_m$ respectively, the corresponding amount of data is $N _1,n_2,..., n_m$, The predicted values for the leaf nodes are:

\[c_m = \frac{1}{n_m}\sum_{x_i \in r_m} y_i \ \ \ \ \ \ (*) \]

The regression tree is a binary tree, each time according to the characteristics of a certain value of the division, each internal node is to do a corresponding feature of the judgment, until the leaf node to get its category, the difficulty of building this tree is how to select the best segmentation features and segmentation features corresponding to the segmentation variables. If this is done by the value of the $J $ feature $s $ to be sliced, the two regions after slicing are:

\[r_1 (j,s) = \left \{x_i|x_i^j \le s \right \} \ \ r_2 (j,s) = \left \{x_i|x_i^j > S \right \}\]

Based on $ (*) $ separately calculated $R _1$ with $R _2$ category estimates $c _1$ with $c _2$, and then calculates the loss after being sliced by $ (j,s) $:

\[\min_{j,s} \left [\sum _{x_i \in r_1} (y_i–c_1) ^2 + \sum_{x_i \in r_2} (y_i–c_2) ^2 \right]\]

Find the $ (j,s) $ pair that minimizes the loss, and recursively execute the selection process for $ (j,s) $ until the stop condition is met. The algorithm that gives the regression tree:

Input: Training data Set $D $

Output: Regression tree $T $

1) to select the optimal segmentation feature $j $ with the split feature value $s $, $ (j,s) $ should meet the following conditions:

\[\min_{j,s} \left [\sum _{x_i \in r_1} (y_i–c_1) ^2 + \sum_{x_i \in r_2} (y_i–c_2) ^2 \right]\]

2) Calculate the data set by the optimal $ (j,s) $ shard:

\[r_1 (j,s) = \left \{x_i|x_i^j \le s \right \} \ \ r_2 (j,s) = \left \{x_i|x_i^j > S \right \}\]

\[c_1 = \frac{1}{n_1}\sum_{x_i \in r_1} y_i \ \ \ \ c_2 = \frac{1}{n_2}\sum_{x_i \in r_2} y_i \]

3) Recursive bar with $ \sim 2) $, know to meet the stop condition.

4) Returns the decision tree $T $.

Classification Tree

CART of Decision Tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.