Following the ID3 and C4.5 of the decision tree in the previous article, this paper continues to discuss another binary decision tree classification and Regression Tree,cart was proposed by people in 1984, is a widely used decision tree algorithm, different from Breiman With C4.5, CART is a two-point decision tree, each time the characteristics of the segmentation will only produce two sub-nodes, and ID3 or C4.5 the branch of the decision tree is based on the selected characteristics of the value of the segmentation feature has the number of different values, there are how many sub-nodes (continuous feature discretization can be). CART design regression and classification, the next step is to introduce the classification tree and the regression tree.
Regression Tree and Model tree
The first simple recall is linear regression: for datasets $D = {(X_1,y_1), (x_2,y_2),..., (x_n,y_n)}$, the model for linear regression is:
\[\bar{y}_i = \theta ^t x_i \]
The loss function can be recorded:
\[l (\theta) = \frac{1}{n}\sum_i (\bar{y}_i-y_i) ^2\]
Liner Regression is already very powerful, can handle linear data sets, for some samples are curved points, you can consider the introduction of higher-order linear regression, or local weighted regression, local weighted regression model with linear regression, but the loss function has some improvements:
\[l (\theta) = \frac{1}{n}\sum_iw_i (\bar{y}_i-y_i) \]
\[w_i = exp \left (\frac{x_i-x}{2 \tau ^2} \right) \]
Here $\tau$ control the rate of change of weight, each sample $x _i$ has a weight, weights according to $\tau$ to adjust the size of the sample point $x _i$ closer to the samples of the weight is large, the local weighted regression model with linear regression, the local weighted regression and linear regression graph are as follows:
Linear regression is a model of global data, the entire data set as a goal optimization, when the data does not appear linear, such as the data shown, the global sharing of an optimization goal is obviously not a good choice, when the data can be divided, piecewise linear regression of partitioned data is a good choice. The data shown can be divided into 5 shards to be processed separately, after the Shard can be a model for each shard, this situation can be called model tree , or simply the data after the Shard, this is the most common regression tree . Next, the regression tree and the model tree are introduced respectively.
The regression tree uses the mean square error as the loss function, and when the tree is generated, it divides the space recursively according to the optimal characteristic and the optimal value, until the stopping condition is satisfied, the stopping condition can be set artificially, for example, the sample capacity of a node is set < min$ is no longer segmented Or when the cut loss value $< \varepsilon $, the slice is stopped and the leaf node is generated.
For the resulting regression tree, the category of each leaf node is the mean value of the label falling to the leaf node data, assuming that the feature space is divided into $M $ parts, that is, there are now $M the leaf nodes are $R _1,r_2,..., r_m$ respectively, the corresponding amount of data is $N _1,n_2,..., n_m$, The predicted values for the leaf nodes are:
\[c_m = \frac{1}{n_m}\sum_{x_i \in r_m} y_i \ \ \ \ \ \ (*) \]
The regression tree is a binary tree, each time according to the characteristics of a certain value of the division, each internal node is to do a corresponding feature of the judgment, until the leaf node to get its category, the difficulty of building this tree is how to select the best segmentation features and segmentation features corresponding to the segmentation variables. If this is done by the value of the $J $ feature $s $ to be sliced, the two regions after slicing are:
\[r_1 (j,s) = \left \{x_i|x_i^j \le s \right \} \ \ r_2 (j,s) = \left \{x_i|x_i^j > S \right \}\]
Based on $ (*) $ separately calculated $R _1$ with $R _2$ category estimates $c _1$ with $c _2$, and then calculates the loss after being sliced by $ (j,s) $:
\[\min_{j,s} \left [\sum _{x_i \in r_1} (y_i–c_1) ^2 + \sum_{x_i \in r_2} (y_i–c_2) ^2 \right]\]
Find the $ (j,s) $ pair that minimizes the loss, and recursively execute the selection process for $ (j,s) $ until the stop condition is met. The algorithm that gives the regression tree:
Input: Training data Set $D $
Output: Regression tree $T $
1) to select the optimal segmentation feature $j $ with the split feature value $s $, $ (j,s) $ should meet the following conditions:
\[\min_{j,s} \left [\sum _{x_i \in r_1} (y_i–c_1) ^2 + \sum_{x_i \in r_2} (y_i–c_2) ^2 \right]\]
2) Calculate the data set by the optimal $ (j,s) $ shard:
\[r_1 (j,s) = \left \{x_i|x_i^j \le s \right \} \ \ r_2 (j,s) = \left \{x_i|x_i^j > S \right \}\]
\[c_1 = \frac{1}{n_1}\sum_{x_i \in r_1} y_i \ \ \ \ c_2 = \frac{1}{n_2}\sum_{x_i \in r_2} y_i \]
3) Recursive bar with $ \sim 2) $, know to meet the stop condition.
4) Returns the decision tree $T $.
Classification Tree
CART of Decision Tree