Decision Tree (regression tree) analysis and application modeling

Source: Internet
Author: User

First, CART decision Tree Model Overview (Classification and Regression Trees)

Decision trees are the process of classifying data through a series of rules. It provides a method of similar rules for what values will be given under what conditions.?? The decision tree algorithm belongs to the instruction learning, that the original data must contain predictor variables and target variables . decision trees are divided into categorical decision trees (target variables are categorical values) and regression decision trees (target variables are continuous variables). classification decision The majority of the output variables of the leaf nodes are the classification results, and the average of the output variables in the leaf nodes of the regression tree is the predicted result.

Decision tree is an inverted tree structure, which consists of internal nodes, leaf nodes and edges. One of the top nodes is called the root node. Constructing a decision tree requires a training set, some examples, and each example is described with some attributes (or features) and a category tag. The purpose of constructing a decision tree is to find out the relationship between attributes and categories, and once that relationship is found, it can be used to predict the categories of records for unknown categories in the future. This prediction-capable system is called a decision tree classifier.

The decision tree has very good advantages:

1) The decision tree does not need any domain knowledge, is simple if ... Then ... Thought

2) Decision tree can deal with high-dimensional data very well, and can filter out important variables.

3) The result of decision tree is easy to understand and grasp;

4) Decision tree is also very fast in the process of operation;

5) In general, the decision tree also has a relatively ideal prediction accuracy rate.

The decision tree of the cart is also called categorical regression tree , and when the dependent variable of the data set is a continuous value, the tree algorithm is a regression tree, which can be used as the predictive value by the mean value observed by the leaf node; When the dependent variable of the data set is a discrete value, the tree algorithm is a classification tree, which can solve the classification problem However, it is important to note that the algorithm is a binary tree , that is, each non-leaf node can only extend two branches, so when a non-leaf node is a multi-level (more than 2) discrete variable, the variable may be used multiple times.

The decision Tree algorithm contains two core problems , feature selection and pruning :

The current popular methods of feature selection are information gain, gain rate, Gini coefficient and chi-square test , and then we introduce the feature selection based on Gini coefficient, because the cart decision tree described in this paper is based on the Gini coefficient selection feature.

On the pruning problem, the main pre-pruning and post -pruning, pre-pruning is to limit the number of trees before the growth of the tree layer, leaf node observations, and then pruning is after the tree has been fully grown, based on the loss matrix or complexity of the method of pruning, the following will be used to modify the tree after pruning.

Second, the core of decision tree issues

The core problem of decision Tree is two: one is to use training data to complete decision tree generation process, and the other is to use testing data to complete the simplification process of decision tree. As we mentioned earlier, there are often too many inference rules generated, and streamlining is required.

1) Growth of decision trees

The essence of the decision tree growth process is the process of repeatedly grouping (branching) the training data, and the decision tree generation process stops when the packet (branch) is no longer meaningful-notice what the grouping is no longer meaningful. therefore, the core algorithm of decision tree growth is to determine the standard of data analysis, that is, the branch standard.

What's the point? Note that when the decision tree is branched, the result difference no longer decreases significantly, and it does not make sense to continue grouping. In other words, the purpose of our group is to make the output variable as small as possible, when reaching the leaf node, the output variable on different leaf nodes is the same category, or the standard that the user-specified decision tree stops generating.

In this way, the branching criterion involves two aspects: 1, if the best grouping variable is selected from many input variables, 2, if the best segmentation point is found from the many values of the grouping variable. Different decision tree algorithms, such as C4.5, C5.0, Chaid, Quest, and cart, adopt different strategies.

? 2) Pruning of decision trees

A complete decision tree is not the best tree for predicting new data objects in a taxonomy. The reason is that the complete decision tree is too "precise" for the training data description. We know that as the decision tree grows, the number of samples processed during the decision tree branching is decreasing, and the decision tree is declining in the overall bead representation of the data. When the root node is branched, the whole sample is processed, and then the branch is branched down, then it is the sample under the grouping under different subgroups. It can be seen that as the decision tree grows and the number of samples decreases, the more deeply the data characteristics of the nodes are more personalized, it may appear as above inference rule: "The annual income is greater than 50000 yuan and the age is more than 50 years old and the name of the person named Zhang San purchased this product." This over-learning thus accurately reflects the training data feature, loses its general representativeness and cannot be applied to new data classification predictions, called over-fitting (Overfitting) or over-learning . So what should we do? Trim!

The usual pruning techniques are pre-trimmed (pre-pruning) and post-trimmed (post-pruning).

Pre-pruning can specify the maximum depth of the decision tree in advance, or the minimum sample size, to prevent the decision tree from growing excessively. The premise is that the user to the variable party has a clearer grasp, and to try to adjust repeatedly, otherwise, can not give a reasonable value. Note that the decision tree grows too deep to predict new data and grows too shallow to predict new data.

Post-pruning is a process of trimming edges, that is, setting a maximum allowable error rate based on the full growth of the decision tree, and then trimming the subtree while calculating the accuracy or error of the output. Stop pruning immediately after the error rate is higher than the maximum value.

The post-pruning based on training data should use testing data.

The C4.5, C5.0, CHAID, cart, and quest in the decision tree use different pruning strategies.

Cases, using Rpart () regression tree? Analysis of blood test index of diabetes mellitus

Install.packages ("Rpart")?

Library ("Rpart")??

Install.packages ("Rpart.plot")

Library (rpart.plot)?

1, the main application function:

1) function of building regression tree: Rpart ()

Rpart (formula, data, weights, Subset,na.action = Na.rpart, method,

Model = False, X = False, y = TRUE, parms, control, cost, ...)

Main parameter Description:

fomula: regression equation form: For example y~x1+x2+x3.

Data : a data frame (dataframe) containing the variables in the preceding equation.

na.action: How to handle missing data: The default method is to delete observations that are missing from the dependent variable and preserve the missing arguments.

Method: according to the data type of tree end , this parameter has four kinds of values: continuous Type "ANOVA", Discrete "class", "Poisson" of counting type (Poisson process) and survival analysis type "EXP". The program automatically chooses the method based on the type of the dependent variable, but it is generally better to indicate this parameter in order to make the program understand which tree model to do.

parms: used to set three parameters: a priori probability, loss matrix, Classification purity measurement method.

Cost : loss matrix, at the time of pruning, the weighted error of the leaf node is compared with the parent's error, when the loss matrix is taken into account, the "reduction-error" is adjusted to "reduction-loss".

Control : controls the minimum sample size on each node, the number of cross-validation times, the complexity parameter: Cp:complexitypamemeter, which means the degree to which the model's goodness of fit must be increased for each step, and so on. Rpart.control Some settings for the tree

Xval is 10 percent cross-validation

Minsplit is the minimum number of branch nodes, where the value is greater than or equal to 20, then the node will continue to be divided, otherwise stop

Minbucket: Minimum sample number of leaf nodes; maxdepth: Depth of tree

CP is all called complexity parameter, which refers to the complexity of a point , and the degree to which the model's goodness of fit must be improved in order to save unnecessary time for pruning waste.

2) function of pruning: prune ()

Prune (tree, CP, ...)

Main parameter Description:

Tree: A regression tree object, often a result object of Rpart ().

CP: Complexity parameter, specifying the threshold to use for pruning. CP is all called complexity parameter, which refers to the complexity of a point, and the degree to which the model's goodness of fit must be improved in order to save unnecessary time for pruning waste.

Second, feature selection

The feature selection of the cart algorithm is based on the Gini coefficient, the criterion of which is that each sub-node achieves the highest purity, that is, all observations falling on the child nodes belong to the same classification. The following is a brief introduction to the calculation of the Gini coefficient:

Assuming that the dependent variable in DataSet D has a M level, that is, the dataset can be divided into M-class groups, the Gini coefficient of DataSet D can be expressed as:


since the cart algorithm is a two-tree form, a multi-level (M-level) discrete variable (independent variable) can divide the dataset D into 2^m-2 possible . For example, if the age group can be divided into {youth, middle age, old age}, then its subset can be {youth, middle age, old age}, {Youth, middle age}, {youth, old age}, {middle age, old age}, {youth}, {middle Age}, {old age}, {}. Where {youth, middle age, old age} and empty set {} are meaningless split, so 6=2^3-2.

For a discrete variable, the weighted sum of the purity of each partition needs to be computed, i.e., for variable A, the Gini coefficient of D is:

For a continuous variable, the midpoint of the sorted adjacent value needs to be used as a threshold (split point), and the same formula is used to calculate the weighted sum of the purity of each partition.

According to the criterion of feature selection, the threshold value under the variable can be determined as the splitting variable and the splitting point only if the Gini coefficient of each partition of each variable is minimized. If this part of the reading is difficult to understand, you can refer to the "Data mining: Concepts and Technology" book, the book has a case of calculation .

Third, pruning

pruning is to prevent the model from overfitting, and more suitable for out-of-sample prediction. in the general decision tree, there are two ways of pruning: pre-pruning and post-pruning, and then pruning is the most frequently used method. In the post-pruning, it is divided into loss matrix pruning method and complexity pruning method, and for loss matrix pruning method, it is to predict a penalty coefficient for error, so that it can reduce the error of prediction to some extent; for the complexity pruning method, is to consider the complexity of the tree as a function of the number of leaf nodes and the error rate of the tree (the ratio of the number of observations in the wrong category). Here's a bit of abstraction, and here's a simple example to illustrate the role of post-pruning.

Iv. Sharing of cases

Take the data of "mastery degree of knowledge" as an example to say how the decision tree realizes the classification of data (data source

: http://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling).

The data set measures the mastery of knowledge by 5 dimensions, namely:

STG: The length of the learning time of the target account;

SCG: The degree of repetition of the target subject;

STR: Length of study in other related subjects;

LPR: test scores for other relevant subjects;

PEG: The test score of the target subject.

The Mastery degree of knowledge is indicated by UNS, it has 4 levels, namely very low, low, middle and high.

#读取外部文件

Train <-read.csv (file = File.choose ())

Test <-read.csv (file = File.choose ())

#加载CART算法所需的扩展包, and build the model

Library (Rpart)

Fit <-Rpart (UNS ~., data = Train)

#查看模型输出的规则

Fit



The above output rules look a bit confusing, and we try to use decision trees to describe the specific rules that are generated. Because of the plot function in the Rpart package to achieve the decision tree drawing, but it looks very ugly, we use the Rpart.plot package below to draw a better-looking decision tree chart:

#加载并绘制决策树图

Library (Rpart.plot)

Rpart.plot (Fit, branch = 1, Branch.type = 1, type = 2, extra = 102,shadow.col= ' gray ', box.col= ' green ', border.col= ' blue ', s Plit.col= ' Red ', main= "cart decision tree")


You can see the specific output rules at a glance, such as the root node has 258 observations, of which there are 88 middle, when peg>=0.68, there are 143 observations within the node, of which middle 78, when peg>=0.12 and peg< At 0.34, there are 115 observations in the node, of which low has 81, and so on, other rules can be drawn.

#将模型用于预测

Pred <-Predict (object = fit, NewData = test[,-6], type = ' class ')

#构建混淆矩阵

CM <-table (test[,6], Pred)

Cm


#计算模型的预测准确率

Accuracy <-sum (diag (cm))/sum (cm)

Accuracy


The results show that the model has more than 91% predictive power in the test set. but is there any possibility of improving the model's predictive accuracy? below we are pruning the model, specific loss matrix method pruning and the complexity of pruning:

According to the result of the confusion matrix, the prediction rate of high is up to 100% (39/39), the forecast rate of low is up to 91.3% (42/46), the middle prediction rate is 88.2% (30/34), and the Very_low prediction rate is up to 80.8 (21/26). If you want to improve the prediction accuracy of very_low, you need to increase its penalty value , after attempting to adjust, the following loss matrix is constructed:

VEC = C (0,1,1,1,1,0,1,1,1,2,0,1,1,3.3,1,0)

Cost = Matrix (VEC, nrow = 4, Byrow = TRUE)

Cost


Fit2 = Rpart (UNS ~., data = Train, parms = List (loss = Cost))

Pred2 = Predict (Fit2, test[,-6], type = ' class ')

CM2 <-table (test[,6], Pred2)

CM2


Accuracy2 <-sum (diag (CM2))/sum (CM2)

Accuracy2


accuracy increase 1.4%, and in the guarantee of high, low, middle accuracy rate unchanged, the increase of very_low accuracy rate of 88.5%, originally 80.8%.

In the following, we use the complexity method to prune , first to look at the original model CP value:

Printcp (FIT)



The complexity pruning method satisfies the condition that when the prediction error (XERROR) is as small as possible (not necessarily the minimum value, but allows a standard deviation of the minimum error (XSTD)), select the smallest possible CP value. Choose cp=0.01 here.

Fit3 = Prune (fit, CP = 0.01)

Pred3 = Predict (Fit3, test[,-6], type = ' class ')

CM3 <-table (test[,6], Pred3)

CM3


Accuracy3 <-sum (diag (CM3))/sum (CM3)

Accuracy3


Obviously, the accuracy of the model is not improved , because the value of the CP that satisfies the condition is 0.01, and the default CP value of function Rpart () is 0.01, so the result of model FIT3 is consistent with fit.

Reproduced in: http://chuansong.me/n/540927751643

Decision Tree (regression tree) analysis and application modeling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.