The regression in the previous section is a global regression model that sets a model, whether linear or non-linear, and then fits the data to obtain parameters. In reality, some data is very complex, the model is almost invisible to the public, so it is a little inappropriate to build a global model. This section describes tree regression to solve such problems. It builds decision nodes to split data into regions and then perform regression fitting for local regions. Let's take a look at the classification regression tree (Cart: Classification and regression trees). The advantage of this model is that it can be used to model complex and non-linear data, the disadvantage is that the result is not easy to understand. As the name suggests, it can be used for classification or regression. As for classification, we have already said in decision tree, but I skipped it here. You can understand it directly by analyzing the regression tree code:
from numpy import *def loadDataSet(fileName): #general function to parse tab -delimited floats dataMat = [] #assume last column is target value fr = open(fileName) for line in fr.readlines(): curLine = line.strip().split('\t') fltLine = map(float,curLine) #map all elements to float() dataMat.append(fltLine) return dataMatdef binSplitDataSet(dataSet, feature, value): mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:][0] mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:][0] return mat0,mat1
The first function loads sample data. The second function is used to split data on a feature and dimension, as shown in figure 1:
(Figure 1)
Note that cart is a type of tree constructed by binary segmentation. The previous decision tree is constructed by using Shannon entropy as the minimum measurement, and the node of the tree is a discrete threshold; shannon entropy is no longer used here, because we need to perform regression. Therefore, the variance of the split data is calculated as the metric, And the node of the tree corresponds to a continuous value (in fact, the feature value) that minimizes the variance ). If the variance is smaller, it means that the node with the error can best express the data. Let's take a look at the Construction Code of the tree:
Def createtree (dataset, leaftype = regleaf, errtype = regerr, Ops = (1, 4): # assume dataset is numpy mat so we can array filtering feat, val = choosebestsplit (dataset, leaftype, errtype, OPS) # choose the best split if feat = none: Return Val # If the splitting hit a stop condition return Val (leaf node value) rettree = {} rettree ['spind '] = feat rettree ['spval'] = Val lset, rset = binsplitdataset (dataset, feat, Val) rettree ['left'] = createtree (lset, leaftype, errtype, OPS) rettree ['right'] = createtree (rset, leaftype, errtype, OPS) return rettree
In this Code, the main task is to select the best segmentation feature, and then split it. If it is a leaf node, it will return, instead of a leaf node, it will generate a recursive tree structure. The best feature segmentation function is called: choosebestsplit. In the previous decision tree construction, entropy is used to measure the feature. Here, the error (variance) is used to measure the feature. Let's look at the code first:
def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)): tolS = ops[0]; tolN = ops[1] #if all the target variables are the same value: quit and return value if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1 return None, leafType(dataSet) m,n = shape(dataSet) #the choice of the best feature is driven by Reduction in RSS error from mean S = errType(dataSet) bestS = inf; bestIndex = 0; bestValue = 0 for featIndex in range(n-1): for splitVal in set(dataSet[:,featIndex]): mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal) if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): continue newS = errType(mat0) + errType(mat1) if newS < bestS: bestIndex = featIndex bestValue = splitVal bestS = newS #if the decrease (S-bestS) is less than a threshold don't do the split if (S - bestS) < tolS: return None, leafType(dataSet) #exit cond 2 mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue) if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): #exit cond 3 return None, leafType(dataSet) return bestIndex,bestValue#returns the best feature to split on #and the value used for that split
The backbone of this Code is:
Traverse each feature:
Traverse each feature value:
Split the dataset into two parts
Calculate the splitting error.
If the split error is less than the current minimum error, update the minimum error value. The current split is divided into the optimal split.
Returns the feature value and threshold value of the optimal splitting.
Pay special attention to the final return value because it is something that builds every node component of the tree. In addition, errtype = regerr calls the regerr function in the code to calculate the variance, which is shown below:
def regErr(dataSet): return var(dataSet[:,-1]) * shape(dataSet)[0]
If the error does not change much (in the Code (S-bests), the leaf node is generated. The function of the leaf node is:
def regLeaf(dataSet):#returns the value used for each leaf return mean(dataSet[:,-1])
In this way, the code for building the regression tree is preliminarily analyzed, and the running result is shown in (Figure 2:
(Figure 2)
Data ex00.txt is given at the end of the article. Its distribution is shown in Figure 3:
(Figure 3)
According to (figure 3), we can see that the running result of the Code (Figure 2) is reasonable. X (represented by 0) is used as the feature to separate features, then a center value is selected for the left and right nodes to describe tree regression. The number of nodes is small, but the problem can be explained. The following shows a result of running complex data, as shown in Figure 4:
(Figure 4)
Figure 5 shows the corresponding data:
(Figure 5)
For the rationality of the leaf nodes and node values of the tree, we will compare them one by one (figure 5. Below is a simple description of the lower Tree Pruning. If the feature dimension is relatively high, it is easy to have too many nodes, resulting in overfitting. overfit will produce high variance, however, the under Fit will produce high bias, which is a topic, because machine learning theory generally needs to talk about this. When there is a fitting, the regular expression method is generally used, since no target function is set up in the regression tree, the method for over-fitting is to trim the tree. Simply put, a small number of key features are used for identification. Let's take a look at how to trim the tree: it is easy to recursively traverse a subtree. Starting from the leaf node, the error after the merger of the two subnodes on the same parent node is calculated, and then the non-merging error is calculated, if merging reduces the error, the leaf nodes are merged. When it comes to errors, the previous choosebestsplit function contains the following code:
#if the decrease (S-bestS) is less than a threshold don't do the split if (S - bestS) < tolS:
Tols is a threshold value. When the error does not change too much, it will not be split. In fact, it is also a method to trim the tree, but it is a method to trim the tree beforehand, while the calculation of the merging error is post-trim. The following is the code:
def getMean(tree): if isTree(tree['right']): tree['right'] = getMean(tree['right']) if isTree(tree['left']): tree['left'] = getMean(tree['left']) return (tree['left']+tree['right'])/2.0 def prune(tree, testData): if shape(testData)[0] == 0: return getMean(tree) #if we have no test data collapse the tree if (isTree(tree['right']) or isTree(tree['left'])):#if the branches are not trees try to prune them lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal']) if isTree(tree['left']): tree['left'] = prune(tree['left'], lSet) if isTree(tree['right']): tree['right'] = prune(tree['right'], rSet) #if they are now both leafs, see if we can merge them if not isTree(tree['left']) and not isTree(tree['right']): lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal']) errorNoMerge = sum(power(lSet[:,-1] - tree['left'],2)) +\ sum(power(rSet[:,-1] - tree['right'],2)) treeMean = (tree['left']+tree['right'])/2.0 errorMerge = sum(power(testData[:,-1] - treeMean,2)) if errorMerge < errorNoMerge: print "merging" return treeMean else: return tree else: return tree
After talking about tree regression, we can simply mention the model tree. Because tree regression has some features and feature values for each node, the selection principle is to minimize the feature variance. If the leaf node is replaced with a piecewise linear function, the model tree is changed, as shown in figure 6:
(Figure 6)
(Fig. 6) consists of two straight lines, divided by X coordinates (0.0-0.3) and (0.3-1.0. If we use two leaf nodes to store two linear regression models, we can fit the data. The implementation is also relatively simple. The Code is as follows:
def linearSolve(dataSet): #helper function used in two places m,n = shape(dataSet) X = mat(ones((m,n))); Y = mat(ones((m,1)))#create a copy of data with 1 in 0th postion X[:,1:n] = dataSet[:,0:n-1]; Y = dataSet[:,-1]#and strip out Y xTx = X.T*X if linalg.det(xTx) == 0.0: raise NameError('This matrix is singular, cannot do inverse,\n\ try increasing the second value of ops') ws = xTx.I * (X.T * Y) return ws,X,Ydef modelLeaf(dataSet):#create linear model and return coeficients ws,X,Y = linearSolve(dataSet) return wsdef modelErr(dataSet): ws,X,Y = linearSolve(dataSet) yHat = X * ws return sum(power(Y - yHat,2))
The code is similar to the tree regression, except that when modelleaf returns a leaf node, a linear regression is completed by linearsolve. The last modelerr function plays the same role as the regerr function of the regression tree.
Thank God, this article does not show any formulas, but it also hopes that the expressions will be clear without a mathematical language.
Data ex00.txt:
0.036098 0.155096
0.993349 1.077553
0.530897 0.893462
0.712386 0.564858
0.343554-0.371700
0.098016-0.332760
0.691115 0.834391
0.091358 0.099935
0.727098 1.000567
0.951949 0.945255
0.768596 0.760219
0.541314 0.893748
0.146366 0.034283
0.673195 0.915077
0.183510 0.184843
0.339563 0.206783
0.517921 1.493586
0.703755 1.101678
0.008307 0.069976
0.243909-0.029467
0.306964-0.177321
0.036492 0.408155
0.295511 0.002882
0.837522 1.229373
0.202054-0.087744
0.919384 1.029889
0.377201-0.243550
0.814825 1.095206
0.611270 0.982036
0.072243-0.420983
0.410230 0.331722
0.869077 1.114825
0.620599 1.334421
0.101149 0.068834
0.820802 1.325907
0.520044 0.961983
0.488130-0.097791
0.819823 0.835264
0.975022 0.673579
0.953112 1.064690
0.475976-0.163707
0.273147-0.455219
0.804586 0.924033
0.074795-0.349692
0.625336 0.623696
0.656218 0.958506
0.834078 1.010580
0.781930 1.074488
0.009849 0.056594
0.302217-0.148650
0.678287 0.907727
0.180506 0.103676
0.193641-0.327589
0.343479 0.175264
0.145809 0.136979
0.996757 1.035533
0.590210 1.336661
0.238070-0.358459
0.561362 1.070529
0.377597 0.088505
0.099142 0.025280
0.539558 1.053846
0.790240 0.533214
0.242204 0.209359
0.152324 0.132858
0.252649-0.055613
0.895930 1.077275
0.133300-0.223143
0.559763 1.253151
0.643665 1.024241
0.877241 0.797005
0.613765 1.621091
0.645762 1.026886
0.651376 1.315384
0.697718 1.212434
0.742527 1.087056
0.901056 1.055900
0.362314-0.556464
0.948268 0.631862
0.000234 0.060903
0.750078 0.906291
0.325412-0.219245
0.726828 1.017112
0.348013 0.048939
0.458121-0.061456
0.280738-0.228880
0.567704 0.969058
0.750918 0.748104
0.575805 0.899090
0.507940 1.107265
0.071769-0.110946
0.553520 1.391273
0.401152-0.121640
0.406649-0.366317
0.652121 1.004346
0.347837-0.153405
0.081931-0.269756
0.821648 1.280895
0.048014 0.064496
0.130962 0.184241
0.773422 1.125943
0.789625 0.552614
0.096994 0.227167
0.625791 1.244731
0.589575 1.185812
0.323181 0.180811
0.822443 1.086648
0.360323-0.204830
0.950153 1.022906
0.527505 0.879560
0.860049 0.717490
0.007044 0.094150
0.438367 0.034014
0.574573 1.066130
0.536689 0.867284
0.782167 0.886049
0.989888 0.744207
0.761474 1.058262
0.985425 1.227946
0.132543-0.329372
0.346986-0.150389
0.768784 0.899705
0.848921 1.170959
0.449280 0.069098
0.066172 0.052439
0.813719 0.706601
0.661923 0.767040
0.529491 1.022206
0.846455 0.720030
0.448656 0.026974
0.795072 0.965721
0.118156-0.077409
0.084248-0.019547
0.845815 0.952617
0.576946 1.234129
0.772083 1.299018
0.696648 0.845423
0.595012 1.213435
0.648675 1.287407
0.897094 1.240209
0.552990 1.036158
0.332982 0.210084
0.065615-0.306970
0.278661 0.253628
0.773168 1.140917
0.203693-0.064036
0.355688-0.119399
0.988852 1.069062
0.518735 1.037179
0.514563 1.156648
0.976414 0.862911
0.919074 1.123413
0.697777 0.827805
0.928097 0.883225
0.900272 0.996871
0.344102-0.061539
0.148049 0.204298
0.130052-0.026167
0.302001 0.317135
0.337100 0.026332
0.314924-0.001952
0.269681-0.165971
0.196005-0.048847
0.129061 0.305107
0.936783 1.026258
0.305540-0.115991
0.683921 1.414382
0.622398 0.766330
0.902532 0.861601
0.712503 0.933490
0.590062 0.705531
0.723120 1.307248
0.188218 0.113685
0.643601 0.782552
0.520207 1.209557
0.233115-0.348147
0.465625-0.152940
0.884512 1.117833
0.663200 0.701634
0.268857 0.073447
0.729234 0.931956
0.429664-0.188659
0.737189 1.200781
0.378595-0.296094
0.930173 1.035645
0.774301 0.836763
0.273940-0.085713
0.824442 1.082153
0.626011 0.840544
0.679390 1.307217
0.578252 0.921885
0.785541 1.165296
0.597409 0.974770
0.014083-0.132525
0.663870 1.187129
0.552381 1.369630
0.683886 0.999985
0.210334-0.006899
0.604529 1.212685
0.250744 0.046297
Reprinted please indicate Source: http://blog.csdn.net/cuoqu/article/details/9502711
References:
[1] machine learning in action. Peter Harrington