Tree model regression tree, model tree, tree pruning

Last Update:2018-07-28 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous decision tree Introduction, we used the ID3 algorithm to construct the decision tree; Here we use the cart algorithm to build the regression tree and the model tree. The ID3 algorithm divides the data by selecting the best feature at a time, and distinguishes it by all possible values of the feature. For example, if a feature has 4 values, the data will be cut into 4 parts. Obviously, the algorithm does not apply to data that has a label value of continuous type.

The cart algorithm uses the two-yuan segmentation method to process continuous variables, that is, each data set cut into about two pieces.

Regression tree

The regression tree uses the cart algorithm to construct the tree, uses the two-yuan segmentation method to divide the data, its leaf node contains the single value. The pseudo code for the function Createtree () that creates the regression tree is roughly as follows:

Find the best feature to split:

If the node cannot be divided again, save the node as a leaf node

Perform Two-dollar segmentation

Call the Createtree () method in the left subtree

Call the Createtree () method in the right subtree

The process of creating a regression tree is similar to a decision tree, except that the segmentation method is different. At the same time, in order to calculate the degree of disorder of the data, the decision tree is the method of using Shannon Entropy, and our label value is a continuous value, and we cannot use this method. So, how do you calculate the chaos of a continuous type of numerical value? First, calculate the average of all the data, and then calculate the value of each data to mean the difference, and then the sum of squares, that is, we use the method of total variance to measure the chaos of the continuous type of numerical value. In the regression tree, the leaf node contains a single value, so the total variance can be obtained by multiplying the mean variance by the number of sample points in the dataset.

Below, the code that calculates the number of consecutive numerical confusions is provided as follows:

#计算分割代价
def spilt_loss (left,right):  #总方差越小, indicating the smaller the data Chaos
    loss=0.0
    Left_size=len (left)
    #print ' Left_size: ', Left_size
    left_label=[row[-1] for row in left
    Right_size=len (right)
    right_label=[row[-1] For row in right]
    loss + = var (left_label) *left_size + var (right_label) *right_size return
    loss

The code that gets the leaf node's predictive value:

#决定输出标签 (Take out the label value of the leaf node data, calculate the average)
def decide_label (data):
    output=[row[-1] for row in data] return
    mean ( Output

Model Tree

The difference between the model tree and the regression tree is that the leaf node of the regression tree is the average of the node data label value. The node data of the model tree is a linear model (the simplest least squares can be used to construct the linear model), and the coefficient w of the linear model is returned, so we can get the predictive value Y, i.e. y=x*w, by multiplying the test data X by W. So the model is made up of multiple linear fragments.

In the same way, we give the prediction value of the leaf node and the code to compute the chaos of the dataset to be divided:

#生成叶节点
def decide_label (DataSet):
    ws,x,y = Linearmodel (DataSet) return
    ws

#计算模型误差
def spilt_ Loss (DataSet):
    ws,x,y = Linearmodel (DataSet)
    yat = X * ws return
    sum (Power (yat-y,2))
    
#模型预测数据
def Modeltreeforecast (ws,datarow):
    data = Mat (DataRow)
    n = shape (data) [1]
    X = Mat (Ones ((1,n)))
    x[:,1 : n] = data[:,0:n-1] return
    X*WS

So how to compare the regression tree with the model tree model is better. A more objective method is to calculate the correlation coefficient between the predicted value and the actual value. The correlation coefficient can be solved by invoking the command Corrcoef (yhat,y.rowvar=0) in the NumPy library, where yhat is the predictive value and Y is the actual value of the target variable.

Pruning

By reducing the complexity of the tree to avoid the process of fitting is called pruning. The pruning of tree is divided into pre pruning and after pruning. In general, both pruning techniques can be used at the same time in order to find the best model.

Pre-pruning: When you choose to create a tree, we limit the number of iterations of the tree (that is, the depth of the tree), and limit the number of samples that are not too small, and the method of setting such an early termination condition is actually called a preset pruning. Zhou Zhihua's watermelon book in the pre-pruning method to do a specific description, interested students can understand. Because I only by the early termination of the conditions to achieve the pre-pruning, this method is relatively simple, do not make a specific description.

After pruning: After the use of pruning methods need to divide the dataset into test sets and training sets. The test set is used to determine whether these leaf nodes can be combined to reduce the test error and, if so, to merge.

Directly on the code:

' Post-pruning process ' #判断是否为字典 def istree (obj): return (Type (obj). __name__== ' Dict ') def getmean (tree): #将叶节点的训练数据的标签值的平均值作为该节点 Predictive value if Istree (tree[' right ']): tree["Right" = Getmean (tree[' right ') if Istree (tree[' left ']): tre E[' Left '] = Getmean (tree[' left ') "Return" (tree[' left ']+tree[' right '])/2.0 #执行后剪枝 (specifically, the test set is sorted into the leaf node by the tree that was previously built)
    , the total variance of the corresponding label value and leaf node prediction value is calculated, and if the rear difference is smaller, the pruning is performed) def prune (testdata,tree): If Len (testData) ==0:return Getmean (tree)  if (Istree (tree[' left]) or istree (tree[' right ')): #判断tree [' left '] and tree[' right ' are dictionaries, if the dictionary is divided by data Lset,rset = Data_spilt (testdata,tree[' index '],tree[' value ']) #划分数据集 if Istree (tree[' left ']): The #如果tree [' left '] is a dictionary, then executes the PR            The Une () function is recursive until tree[' left] is a leaf node, ending recursion, continuing to execute function tree[' left ' = prune (lset,tree[' left ') if Istree (tree[' right ')): #在tree [' left] ' performs recursion on a recursive basis, so that you can take the value of all leaf nodes on either side of the tree[' right ' = prune (lset,tree[' right ']) if not ISTR EE (tree[' left ']) and not Istree (tree[' right ']): #如果tree [' left '] and tree[' OK ' are not dictionaries, perform the following actions Lset,rset = data_spilt (testdata,tree[' index '],t  ree[' value '] #分割数据集 left_value = [row[-1] for row in LSet] #取出左数据集的节点值 right_value = [Row[-1] for row In RSet] #取出右数据集的节点值 if tree[' left ' are none or tree[' right ' are none: #如果出现tree [' left '] or tree[' right ' to none, Returns the tree, does not perform pruning operations return trees Else:errornomerge = SUM (Power (left_value-tree[' left '],2)) + SUM (p
            Ower (right_value-tree[' right '],2)) #计算没剪枝时测试集的标签值与叶节点的预测值的总方差 Treemean = (tree[' left '] + tree[' right '])/2.0 Testset_value = [Row[-1] for row in testData] Errormerge = SUM (Power (testset_value-treemean,2)) #
                Calculates the total variance of the label value of the test set after pruning and the predicted value of the leaf node if Errormerge < errornomerge: #如果剪枝后的方差小于剪枝前, the pruning is performed; print ' merging ' return Treemean Else:return tree Else:return TR Ee

Above, is I in the learning process of regression tree, model tree, tree pruning some of the summary.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More