The python implementation of the cart tree regression and its pruning

Last Update:2017-10-08 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transfer from Mu Chen

Read Catalogue

Objective
Regression tree
Optimization work of regression tree-pruning
Model Tree
Use of regression tree/model tree
Summary

Back to the top of the preface

The regression algorithms discussed in this paper are all global and the regression of linear problem, even the local weighted linear regression method, also has its drawbacks (refer to the preceding article for details)

Using a global model can cause the model to be very bloated, because all the sample points need to be computed, and many of the samples in real life have a lot of feature information.

On the other hand, more problems in real life are nonlinear problems.

To solve these problems, we have the tree regression series algorithm.

Back to the top of the regression tree

In the previous decision tree learning, building a tree is the ID3 algorithm used. In the field of regression, the problem with this algorithm is that derived subtrees are derived from all possible values.

Therefore, the ID3 algorithm cannot process continuity data.

Therefore, a two-yuan segmentation method can be used to slice a particular value for the boundary. Under this segmentation method, the number of subtrees is less than or equal to 2.

In addition, to modify the merit Principle Shannon entropy (because the data becomes continuous type), it can be built into a tree can be used for regression, so a tree is called a regression tree.

To construct a pseudo-code for a regression tree:

1 Find the best feature to split: 2     If the node cannot be divided, save this node as a leaf node. 3     perform two-yuan segmentation 4     or so the subtree recursively calls this function separately

Pseudo-code for binary segmentation:

1 pairs per feature: 2     for each eigenvalue: 3         cut the dataset into two parts 4         calculate the Shard error 5         If the current error is less than the minimum error, update the best slice and the minimum error.

In particular, there are three conditions for terminating the division (and directly establishing a leaf node):
1. Feature values are divided
2. Dividing a subset is too small
3. No improvement in error after division
These operations are called "pre-pruning".
The following is a small program that gives a complete regression tree:

  1 #!/usr/bin/env Python 2 #-*-coding:utf-8-*-3 4 "' 5 Created on 20**-**-** 6 7 @author: Fangmeng 8" ' 9 From numpy import * One-off def loaddataset (fileName): 13 ' Load test data ' Datamat = [] + FR = Open (fi          Lename) (fr.readlines): CurLine = Line.strip (). Split (' \ t ') 19 # All elements converted to floating-point types (function programming) 20 Fltline = Map (float,curline) datamat.append (fltline) return Datamat 23 24 #===================        ========= 25 # Input: + # DataSet: Data set to be sliced # feature: Slicing feature ordinal # value: Tangent 29 # output: 30 # MAT0,MAT1: Segmentation Results #============================-def binsplitdataset (DataSet, feature, value): 33 ' shard data set ' 34 3 5 Mat0 = Dataset[nonzero (Dataset[:,feature] > value) [0],:][0] mat1 = Dataset[nonzero (Dataset[:,feature] < = value) [0],:][0] Notoginseng return MAT0,MAT1 38 39 #======================================== 40 # Input: $ # DataSet: number        According to episode 42 # output: 43 #Mean (dataset[:,-1]): mean (that is, the content of the leaf node) #======================================== def regleaf (dataSet): 46 ' Generate leaf node ' 47  Mean (Dataset[:,-1]) 49 50 #======================================== 51 # Input: # DataSet: Datasets 53 # Output: # var (dataset[:,-1]) * SHAPE (DataSet) [0]: Squared error #======================================== + def reg ERR (DataSet): 57 ' Calculate squared error ' (dataset[:,-1]) * SHAPE (DataSet) [0] 60 61 #======================= ================= 62 # Input: $ # DataSet: Data set # Leaftype: Leaf node Generator # Errtype: Error statistics PS: Correlation parameter 67 # output: Bestindex #: Best Partitioning feature, Bestvalue: Best partitioning eigenvalue 70 #====================================== = = Def choosebestsplit (DataSet, Leaftype=regleaf, Errtype=regerr, ops= (1,4)): 72 ' Select optimal Partitioning ' 73 74 # Get the relevant parameters  Maximum sample count and minimum error effect lift value tols = ops[0]; Toln = Ops[1] 77 78 # If the values of all sample points are consistent, then the leaf nodes are established directly. If Len (set (dataset[:,-1). T.tolist () [0]) = = 1:80 return None, Leaftype (DataSet) Bayi M,n = shape (DataSet) 83 # Current Error up to S = Errtype (d  Ataset) 85 # Minimum error bests = INF;  87 # The minimum error corresponds to the partitioning method of Bestindex = 0; Bestvalue = 0 90 91 # for all features Featindex in range (n-1): 93 # All eigenvalues for a feature 94 for  Splitval in Set (Dataset[:,featindex]): 95 # divided into mat0, mat1 = Binsplitdataset (DataSet, Featindex, Splitval) 97 # If the number of a sub-set is divided after a non-conformance 98 if (shape (MAT0) [0] < Toln) or (Shape (MAT1) [0] < Toln): C Ontinue 99 # Error of current partitioning method-NewS = Errtype (mat0) + Errtype (MAT1) 101 # If the error of this partitioning method is less than the minimum error 1                 If NewS < bests:103 Bestindex = featIndex104 Bestvalue = splitVal105 bests = newS106 107 # If the current partitioning method is not as good as the error effect when not divided 108 if (s-bests) < tols:109 return None , Leaftype (DataSet) 110 # is divided according to the best partitioning 111 mat0, MAT1 = Binsplitdataset (DataSet, Bestindex, Bestvalue) 112 # If the number of sub-sets is not met after partitioning 113 if (shape (MAT0) [0] < Toln) or (shape (MAT1) [0] < Toln): Leaftype return None, bestindex,bestvalue117 118 (DataSet) #==============         ==========================119 # Input: # DataSet: DataSet 121 # Leaftype: leaf node Generator 122 # Errtype: Error 123 # OPS: Correlation parameter 124 # output: Rettree #: Regression tree 126 #========================================127 def createtree (dataSet , Leaftype=regleaf, Errtype=regerr, ops= (1,4)): 128 ' Build regression tree ' 129 130 # Select the best partitioning Method 131 feat, val = choosebestspli T (DataSet, Leaftype, Errtype, OPS) # feat for None without dividing back leaf nodes 133 If feat = None:return val #if the splitting hi T a stop condition return val134 135 # recursive call build function and update tree 136 rettree = {}137 rettree[' spind '] = feat138 RetT ree[' spval ' = val139 lSet, RSet = Binsplitdataset (DataSet, Feat, val) rettree[' left '] = Createtree (LSet, Leaft Ype, Errtype, OPS)141 rettree[' right '] = Createtree (RSet, Leaftype, Errtype, OPS) 142 143 return Rettree 144 145 def test (): 146 ' Show Results ' 147 148 # load Data 149 Mydat = Loaddataset ('/home/fangmeng/ex0.txt ') 150 # Build regression tree 151 Mydat = Mat (my Dat) 153 print Createtree (MYDAT) 154 155 156 If __name__ = = ' __main__ ': 157 test ()

Test results:

In the above code, the condition of terminating recursion has been added to the heavy "pruning" work.

These pruning operations at the time of achievement are usually pre-pruning. This is very necessary, the pre-pruning tree is almost no pre-pruning tree size of 1% or even smaller, and the performance is similar.

After the tree is built, more and more efficient pruning is done based on the training set and the test set, which is called "post-pruning".

It can be seen that pruning is a large amount of work, which is a very critical optimization process for the tree.

The pseudo-code for the post-pruning process is as follows:

1 Based on existing tree segmentation test data: 2     If any subset is a tree, the procedure is recursive on that subset. 3     calculates the error 4 after merging the current two leaf nodes to     calculate the error of not merging 5     if the merge decreases the error, the leaf nodes are merged.

The specific implementation functions are as follows:

 1 #=================================== 2 # Input: 3 # obj: Judging object 4 # Output: 5 # (Type (obj). __name__== ' Dict '): Judging results 6 #=================================== 7 def istree (obj): 8 ' Determines whether the object is a tree type ' 9 return (type (obj). __name__== ' di CT ') 11 12 #===================================13 # Input: # Tree: Processing Object 15 # Output: # (tree[' left ']+tree[' right ' ])/2.0: Replacement value after collapse #===================================18 def getmean (tree): 19 ' collapse treatment ' if Istree (tree[' righ     T ']): tree[' right '] = Getmean (tree["right")) if Istree (tree[' left '): tree["Left" = Getmean (tree[' left ')) 23 24        Return (tree[' left ']+tree[' right ')/2.025 26 #===================================27 # Input: # Tree: Handle Object 29 # TestData: Test Data Set 30 # Output: # Tree: Pruning trees #=================================== def prune (tree, TestData) : 34 ' After pruning ' 35 36 # No test data collapsed this tree if shape (testData) [0] = = 0:38 return Getmean (tree) 39 40 # If The left/right subset is a tree type, if (isTrEE (tree[' right ')) or Istree (tree[' left ')): 42 # Divided Test set of lSet, RSet = Binsplitdataset (testData, tree[' Spind  '], tree[' Spval ']) 44 # Recursive pruning on new tree new test set (tree[' left '): tree["Left" = prune (tree[' left '], LSet) if Istree (tree[' right '): tree["Right" = prune (tree[' right '], RSet) 47 48 # If the two subsets are leaves, decide whether to merge after the error evaluation. Istree (tree[' left ") and not Istree (tree[" right "): LSet, RSet = Binsplitdataset (TestData, tree[' s PInd '], tree[' Spval ']) Errornomerge = SUM (Power (lset[:,-1]-tree[' left '],2)) +sum (Power (rset[:,-1]-tree[' righ T '],2)) Treemean = (tree[' left ']+tree[' right '])/2.053 errormerge = SUM (Power (testdata[:,-1)-treemean,2 ) If Errormerge < errornomerge:55 return treeMean56 Else:return tree57 Else:retur N Tree

Go back to the top of the model tree

This is also a great tree regression algorithm.

The algorithm sets all the leaf nodes not as a value, but rather establishes a linear model for the leaf part nodes. For example, can be the least squares of the basic linear regression model.

This is a set of linear regression coefficients that are stored in the leaf nodes. The non-leaf nodes are constructed in the same way as the regression tree.

This is the function head that builds the regression tree algorithm above:

Createtree (DataSet, Leaftype=regleaf, Errtype=regerr, ops= (1,4)):

For the model tree, you only need to modify the implementation of the Leaftype (leaf node constructor) and Errtype (Error Analyzer), respectively, corresponding to the following modelleaf and Modelerr functions:

 1 #========================= 2 # Input: 3 # DataSet: Test set 4 # Output: 5 # Ws,x,y: Regression model 6 #==================== ===== 7 def linearsolve (DataSet): 8 ' helper function for building a linear regression model. ' 9 M,n = shape (dataSet) X = Mat (Ones ((m,n))); Y = Mat (Ones (m,1)) x[:,1:n] = dataset[:,0:n-1]; Y = dataset[:,-1]15 XTx = x.t*x16 if Linalg.det (xTx) = = 0.0:17 raise Nameerror (' coefficient matrix not reversible ') WS = XTX.I * (x.t * Y) return ws,x,y20 21 #=======================22 # Input: $ # DataSet: DataSet 24 # output: # WS     : Regression factor #=======================27 def modelleaf (DataSet): 28 ' leaf node Builder ' ws,x,y = Linearsolve (DataSet) 31 Return WS32 33 #=======================================34 # Input: X # DataSet: DataSet 36 # Output: PNS # sum (Power (Y- yhat,2): squared error #=======================================39 def modelerr (dataSet): 40 ' ERROR analyzer ' Ws,x,y = Lin Earsolve (dataSet) yhat = X * ws44 return sum (Power (y-yhat,2))

The previous work mainly introduced two kinds of tree-regression tree, the construction of the model tree, the following further learn how to use these trees to make predictions.

Of course, the essence is recursive traversal of the tree.

The following is the traversal code, which is used by modifying the parameter settings to pass in the regression tree or model tree:

 1 #============================== 2 # Input: 3 # Model: Leaves 4 # indat: Test Data 5 # Output: 6 # Float (model): Leaf value 7 #============================== 8 def regtreeeval (model, Indat): 9 ' regression tree prediction ' Ten return float (model) 12 13 #== ============================14 # Input: # model: Leaf # indat: Test Data 17 # Output: # float (x*model): Leaf value 19 #= =============================20 def modeltreeeval (model, Indat): 21 ' Model tree prediction ' n = shape (indat) [1]23 X = Mat (one S ((1,n+1))) X[:,1:n+1]=indat25 return float (X*model) 26 27 #==============================28 # Input: # tre E: To traverse the tree # Indat: Test data # Modeleval: Leaf value Picker 32 # Output: 33 # Classification Results #==============================35 de F Treeforecast (tree, InData, modeleval=regtreeeval): 36 ' prediction using regression/model tree (modeleval parameter specified) ' 37 38 # If non-tree type, return value. If not Istree (tree): Return Modeleval (tree, InData) 40 41 # Left traversal if indata[tree[' Spind ']] > tree[' SP Val ']:43 if Istree (treE[' left '): Return Treeforecast (tree["left", InData, Modeleval) Else:return modeleval (tree[' left '), InData) 45 46 # Right traversal of else:48 if Istree (tree[' R '): Return treeforecast (tree[' OK '], inData, Modeleval) Else:return modeleval (tree[' right '), InData)

The use of the method is very simple, the tree and the sample to be categorized into it. If it is a model tree, change the third parameter of the classification function treeforecast to Modeltreeeval.

This is no longer a demonstration of the experimental process.

Back to the top of the summary

1. Which regression method to choose, depends on which method has a high correlation coefficient. (can be calculated using the CORRCOEF function)

2. Tree regression and classification algorithms are essentially greedy algorithms, and are constantly searching for local optimal solutions.

3. The discussion on the reunification was over, and the next step was to go into the Unsupervised learning section.

The python implementation of the cart tree regression and its pruning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More