1. Brief Introduction
The linear regression method can fit all sample points effectively (except local weighted linear regression). When the data has many characteristics and the relationship between the features is very complex, the idea of building a global model is one of the difficult one is clumsy. In addition, many problems in practice are nonlinear, such as the frequently seen piecewise functions, which cannot be fitted with a global linear model.
Tree regression splits a dataset into multiple, easily modeled data, and then uses linear regression to model and fit. This paper introduces the classical Tree regression cart (classification and regression trees, categorical regression tree) algorithm.
2. Classification regression tree basic flow
Build tree:
1. Find the [best feature to be segmented]
2. If no further segmentation is possible, save the node as a [leaf node] and return
3. Cut the data set into left and right subtrees according to the best-to-be-split feature (for convenience, assuming that the value is greater than the eigenvalues, then left, or less than true)
4. [Build tree] for Zuozi
5. To the right subtree [build tree]
Best features to be segmented:
1. Traversing features
1.1 Traversing features all eigenvalues
1.1.1 Calculating the [ERROR] of data set segmentation by this eigenvalue
2. Select the feature with the least error and its corresponding value as the best feature to be segmented and return
Regression tree-based predictions:
1. Determine if the current regression tree is a leaf node, if it is [forecast], if not then execute 2
2. Compare the characteristic values of the test data corresponding features with the current regression tree, if the test data eigenvalue is large, then the Zuozi of the current regression tree is the leaf node, if it is not a leaf node [based on regression tree prediction], if it is a leaf node, then [prediction]; Determine if the right subtree of the current regression tree is a leaf node, and if it is not a leaf node then [prediction based on regression tree], if it is a leaf node, [prediction]
3. Practice notes for categorical regression trees
Error, leaf node and prediction of the correlation relationship between the three, a relatively simple error is the Y-value mean variance, the leaf node corresponding to the set of all the samples under the Y-value mean, the prediction of the time according to the decision to return the Y-value of the node under the mean.
When selecting the best feature to be segmented, there are two parameters, one is the allowable error descent, and the other is the minimum number of samples. For the allowable error drop value, in the actual process, it is necessary to reduce the error after the segmentation should be at least greater than the bound; for the minimum number of samples to be sliced, that is, the segmented subtree should contain more samples than the bound. In fact, both of these strategies are designed to avoid overfitting.
4 Tree Pruning
It is a kind of pre-pruning behavior to avoid overfitting by setting parameters when selecting the best feature to be segmented, and then pruning after the establishment of the regression tree is a kind of post-pruning behavior.
The post-pruning process is as follows:
If any subset is a tree, recursive pruning in the sub-set
Calculates the error after merging the current two leaf nodes
Calculate the error of non-merging
Compare the errors before and after merging, if the combined error decreases, the leaf nodes are merged
5 Model Tree
When the leaf node is based on the model, the corresponding model tree is constructed, which is related to the error, leaf node and prediction. The previous linear regression model can be used to establish the corresponding leaf nodes. This error is calculated using the error in the linear regression, and the prediction is based on the parameters of the leaf node after fitting its sample.
6 Programming Implementation
Here Createtree is responsible for the construction of the tree, the Choosebestsplit function is responsible for the selection of the best band-cut feature, while the OPS parameters are two bound settings, prune the relevant post-pruning.
Here Regerr, Regleaf, and regtreeeval are errors, leaf nodes, and predictions based on simple mean values, while Modelerr, Modelleaf, and Modeltreeeval (+linearsolve) are based on linear retrospective model errors, Leaf nodes and projections.
Data set Link: http://pan.baidu.com/share/link?shareid=3744521160&uk=973467359 Password: 9IVD
CART (categorical regression tree)