nineth Chapter Tree Regression
CART algorithm regression and model tree tree reduction algorithm the use of the GUI in Python
Linear regression needs to fit all the sample points (except for local weighted linear regression), it is impossible to use global linear model to fit any data when the data has many characteristics and the relationship between features is very complex.
The data set is cut into a lot of easy modeling data, then the linear regression technique is used to model the broken.
This chapter introduces the tree-building Algorithms for CART (classification and regression trees, classification regression tree), which can be used for classification and regression.
9.1 Local modeling of complex data
The decision tree of CHAP3 is to cut the data into small data sets until all the target variables are identical or the data can no longer be segmented. Decision tree is a greedy algorithm, which does not consider whether it can reach the global optimal. The algorithm ID3 algorithm selects the best feature to segment the data each time, and divides it according to all possible values of the feature, and then the feature will not work again. Another method is two-yuan segmentation method, each time the data set into two, if the data is a characteristic of the value of the segmentation required, then the data into the tree left subtree, and vice versa right subtree. The binary splitting method can deal with the continuous feature and save the time of tree construction.
Cart uses two-yuan segmentation to deal with continuous variables and is widely used.
Construction of 9.2-continuous and discrete-type tree
Like CHAP3, the structure of a tree is stored in a dictionary. Includes 4 elements:
The feature to be split is the left subtree of the feature to be split. It's a single value when you no longer need to slice. Right child tree. Similar to the left subtree.
The cart algorithm can fix the tree's data structure, the tree contains the left and right keys, and can store another subtree or a single value.
Pseudo code:
Find the best feature to split:
If the node cannot be divided again, save the node as a leaf node
Perform Two-dollar segmentation
Call the Createtree () method in the right subtree
Call the Createtree () method in the left subtree
Coding:
#!/usr/bin/env python # coding=utf-8 from numpy import * def loaddataset (fileName): Datamat = [] fr = Open ( FileName) for line in Fr.readlines (): CurLine = Line.strip (). Split ("\ n") Fltline = map (float, curline) #将每行的每个元素映射为浮点数 datamat.append (Fltline) return Datamat def binsplitdataset (DataSet, feature,value): #数据 Set, the feature to be split, the characteristic value, the data set is divided into two subsets Mat0 = Dataset[nonzero (Dataset[:,feature] > value) [0],:][0] mat1 = Dataset[nonzero (Dataset[:,feature] <= value) [0],:][0] return mat0, Mat1 def createtree (dataSet, Leaftype = regleaf, Errtype = Regerr, ops= (1,4)): Feat, Val
= Choosebestsplit (DataSet, Leaftype, Errtype, OPS) #将数据集进行切分 If feat = = None:return val Else: Rettree = {} rettree["Spind"] = feat rettree["Spval"] = val LSet, RSet = Binsplitdataset (DataSet, Feat, val) rettree["Left" = Createtree (LSet, Leaftype, Errtype, OPS) #递归切分 Rettree[' right '] = Createtree (RSet, Leaftype, Errtype, OPS) return Rettree Testmat = Mat (Eye (4)) mat0, MAT1 = Binspli
Tdataset (Testmat, 1, 0.5) # print Testmat print mat0 print MAT1
9.3 Using the cart algorithm for regression
The regression tree assumes that the leaf nodes are constant values. This strategy holds that complex relationships in data can be summed up in a tree structure.
To successfully build a tree with piecewise constant as a leaf node, it is necessary to measure the consistency of the data. First calculate the mean value of all the data, then calculate the value of each data to mean the difference, the general use of absolute value or square difference to replace the difference, similar to the variance calculation, the variance is the mean of square error (mean variance), here need to calculate the total value of the square error (variance), The mean variance (var function) multiplied by the number of sample points in the dataset can be broken.
Build tree
The objective of the Choosebestsplit () function is to find the best location for data set segmentation, to traverse all the features and their possible values to find the segmentation threshold that minimizes the error.
Pseudo code:
For each feature:
For each characteristic value:
Cut the dataset into two pieces
Calculate the error of the segmentation
If the current error is less than the current minimum error, then the current cut is defined as the best segmentation and the minimum error returns the feature and threshold of the best segmentation.
Data:
Figure 9-1: Experimental data Part sample data
Coding:
The segmentation function of #============== regression tree ============================= def regleaf (dataSet): Return mean (dataset[:,-1)) #返回叶节点, the mean def regerr (dataset) of the target variable in the regression tree: Return var (dataset[:,-1]) *shape (DataSet) [0] #误差估计, calculates the square error of the target variable, and returns the total Error def choosebestsplit (dataSet, Leaftype = regleaf, Errtype = Regerr, ops= (1,4)): Tols = ops[0] #容许的误差下降值 Toln = ops[1] #切分的最少样本数 If Len (set (dataset[:,-1).
T.tolist () [0]) ==1: #如果目标值相等, Exit return None, Leaftype (DataSet) Else:m,n = shape (DataSet) S = Errtype (dataSet) bests = inf Bestindex = 0 bestvalue = 0 for Featinde X in range (n-1): #对所有特征进行遍历 to Splitval in Set (Dataset[:,featindex]): #遍历
All eigenvalues of a feature MAT0,MAT1 = Binsplitdataset (DataSet, Featindex, Splitval) #按照某个特征的某个值将数据切分成两个数据子集 if (Shape (mat0) [0]<toln) or (Shape (MAT1) [0]<toln): #如果某个子集行数不大于tolN, and should not be split continue NewS = Errtyp E (mat0) + errtype (MAT1) #新误差由切分后的两个数据子集组成的误差 if NewS < bests: #判断新切分能否降低误差, Bestindex = Featindex Bestvalue = Splitval Best S = NewS if (s-bests) <tols: #如果误差不大则退出 return None, Leafty PE (DataSet) mat0, MAT1 = Binsplitdataset (DataSet, Bestindex, Bestvalue) if (Shape (mat0) [0]<toln) or (SH Ape (MAT1) [0]<toln): #如果切分出的数据集很小则退出 return None, Leaftype (dataSet) return Bestindex, Bestvalu E #===================================================================== mydat = LoadDataSet ("ex00.txt") MyMat = Mat (mydat) print createtree (mymat) MYDAT1 = Loaddataset ("ex0.txt") MyMat1 = Mat (myDat1) print Createtree (MYMAT1)
Segmentation effect:
Figure 9-2: Segmentation effect 9.4 Tree Pruning
A tree if there are too many nodes, it indicates that the model may have "over fit" the data. You can use Cross-validation on the test set to discover the fit.
Pruning (Pruning): Reduces the complexity of decision trees to avoid fitting. It is divided into pre pruning (prepruning) and post pruning (postpruning), which requires the use of training sets and test sets.
Pre-pruning
Tree construction algorithms are very sensitive to input parameters Tols and Toln, and it is not a good idea to get reasonable results by constantly modifying the stop condition.
After pruning
After pruning you need to use a test set. First specify the parameters, so that the tree is large enough, complex enough to be easy to prune. The leaf node is found from the top down, and the test set is used to determine whether these node merging can reduce the test error. is the merge.
Pseudo code:
Based on existing tree segmentation test data:
If any subset is a tree, then the recursive pruning process in that subset
Calculates the error after merging the current two leaf nodes
Calculate the error of not merging
If the merge reduces the error, the leaf node is merged
Coding:
#============= regression tree pruning function ========================================== def istree (obj): #测试输入变量是否是一棵树, Used to determine whether the currently processed node is a leaf node return (type (obj). __name__== "Dict") def Getmean (tree): #递归从上到下, to leaf nodes, find two leaf nodes to calculate their average I F Istree (tree["right"]): tree[' right ' = Getmean (tree["right") if Istree (tree["left"]): tree["left" = Getmean (tree[' left ']) return (tree[' left ']+tree[' right ')/2.0 def prune (tree, testData): If shape (testData) [0 ] = = 0:return Getmean (tree) #if We have no test data collapse the "tree if" istree (tree[' right ']) or istree (tree[' left ']): #if the branches are not trees try to prune them lSet, RSet = Binsplitdataset (testData, tree[' Spind '], tree[' Spval ']) if Istree (tree[' left "): tree[' left ' = prune (tree[' left '], LSet) if Istree (tree[' right ']): tree[' right ' = Prune (tree[' right '), RSet) #if They are now both Leafs, and if we can merge them if not Istree (tree[' left ']) a nd not Istree (tree[' righT ']): lSet, RSet = Binsplitdataset (testData, tree[' Spind '], tree[' Spval ']) Errornomerge = SUM (Power (lset[: , -1]-tree[' left '],2) +\ sum (Power (rset[:,-1)-tree[' right '],2)) Treemean = (tree[' left ']+tree[' rig
HT '])/2.0 errormerge = SUM (Power (testdata[:,-1)-treemean,2)) if Errormerge < errornomerge: Print "Merging" return Treemean else:return tree else:return Tree mytree = Createtree (mymat
2, ops= (0,1)) Mydattest = Loaddataset ("ex2test.txt") Mymat2test = Mat (mydattest) print prune (mytree, mymat2test)
9.5 Model Tree
With the tree to model the data, in addition to the leaf node set to the constant value, it can also be set as a piecewise linear function, piecewise linear (piecewise linear) that the model is composed of multiple linear fragments. Figure 9-3: piecewise linear data used to test model tree construction functions
We can design two lines from 0.0-0.3, from 0.3~1.0, and get two linear models, i.e. piecewise linear models.
It is easier to understand two straight lines than a lot of nodes to make a big tree. The interpretative property of the model tree is one of its advantages over the regression tree. The model tree also has a higher predictive accuracy. The tree generation algorithm is used to segment the data, and each segmentation data can be easily expressed by linear model, and the key is to find the best segmentation.
Coding:
#========== model tree leaf node generation function =========
def linearsolve (DataSet): #执行简单的线性回归
m,n = shape (DataSet)
X = Mat (Ones (m,n)
) Y = Mat (Ones (m,1))
x[:, 1:n] = dataset[:, 0:n-1]
Y = dataset[:,-1] #将X, the data in Y is formatted
XTX = x.t*x
if Linalg.det (xtx) = = 0.0:
raise Nameerror ("This matrix are singular, cannot do inverse")
ws = LINALG.PINV (XTX) * (x.t*y)
ws = xtx.i* (x.t*y) return
ws, X, Y
def modelleaf (dataSet): # It is responsible for generating the leaf node model
ws, x, y = linearsolve (DataSet) return WS-Def modelerr (dataset) when data is no longer needed for segmentation
:
ws, x, y = Linearsolve (dataSet)
yhat = x*ws return
sum (Power (y-yhat,2)) #计算平方误差
myMat2 = Mat ( Loaddataset ("Exp2.txt"))
#print mymat2,type (MYMAT2)
print Createtree (MYMAT2, Modelleaf, Modelerr, (1,10))
Effect:
Figure 9-4: Segmentation Results
The two linear models are y=3.468+1.185x and y=0.00168+11.964x, and the actual data is generated by the model y=3.5+1.0x and y=0+12x plus Gaussian noise, which can be seen as a good result. 9.6 Example: Comparison of tree regression and standard regression
To calculate the effect of model tree, regression tree and other model, the objective method is to calculate correlation coefficient, r*r value, NumPy corrcoef (Yhat, y, Rowvar = 0) and Pearson correlation coefficient.
Coding:
#================ code for predicting with tree regression ============= def regtreeeval (model, Indat): Return float (model) def modeltreeeval (model, Indat): n = shape (Indat) [1] X = Mat (Ones ((1,n+1)) x[:,1:n+1] = Indat return float (x*model) def Treefore Cast (tree, Indata, modeleval=regtreeeval): If not Istree (tree): Return Modeleval (tree, Indata) #如果输入单
The number or row vector, which returns a floating-point value else:if indata[tree["Spind"] > tree["Spval"]: If Istree (tree["left"]): Return Treeforecast (tree["left"], Indata, Modeleval) Else:return modeleval (tree["Le FT "], Indata) else:if istree (tree[" right "): Return Treeforecast (tree[" right "), Indat A, Modeleval) Else:return modeleval (tree["right", Indata) def createforecast (tree, TestData , modeleval=regtreeeval): M = Len (testData) Yhat = Mat (Zeros ((m,1))) for I in Range (m): yhat[i,0] = Treeforecast (Tree, Mat (Testdata[i], Modeleval) #多次调用treeForeCast函数, put the result in the form of a column in the yhat variable return yhat Trainmat = Mat (Loaddataset ("Bikespeedvsiq_ Train.txt ")) Testmat = Mat (Loaddataset (" Bikespeedvsiq_test.txt ")) Mytree = Createtree (Trainmat, ops= (1,20)) Yhat = Createforecast (Mytree, testmat[:,0]) print "Pearson correlation coefficient of the regression tree:", Corrcoef (Yhat, testmat[:,1), rowvar=0) [0,1] mytree = Createt Ree (Trainmat, Modelleaf, Modelerr, (1,20)) Yhat = Createforecast (Mytree, testmat[:,0], modeltreeeval) print "Pearson-phase relationship of the model tree Number: ", Corrcoef (Yhat, testmat[:,1], rowvar=0) [0,1] ws, X, Y = Linearsolve (trainmat) print" linear regression coefficient: ", ws for I in Range (shape ( Testmat) [0]): yhat[i] = testmat[i,0]*ws[1,0] + ws[0,0] print "Pearson correlation coefficient of linear regression model:", Corrcoef (Yhat, testmat[:,1), rowvar=0) [ 0,1]
Effect:
Figure 9-5: Pearson correlation coefficient of regression tree, model tree and simple linear regression
We can see that the result of the model tree is better than that of the regression tree.
9.7 Using Python's tkinter library to create a GUI
This summary uses a graphical user interface (GUI, graphical user Interface) framework for Python--tkinter.
The Tkinter GUI consists of a few widgets (Widge). Widgets: Objects such as text boxes, buttons, labels, and check buttons. Figure 9-6: Tkinter's Hello World
When MyLabel calls the grid () method, the location of the MyLabel is told to the layout manager, and the grid () function arranges the widget in a two-dimensional table.
Coding:
#!/usr/bin/env python # coding=utf-8 #用于构建树管理器界面的Tkinter小部件 from numpy import * tkinter import * Import Regtrees def Redraw (Tols, Toln): Pass Def drawnewtree (): Pass root = Tk () Label (root, Text = "Plot Place Holder"). Grid (row= 0, columnspan=3) #设置文本, line No. 0, the row value from 0 is 3, Label (root, Text = "Toln"). Grid (Row=1, column=0) Tolnentry = Entry (Root) #Entry为允许单行文本输入的文本框, set the text box, and then position the 1th row, column 1th, and then insert the value Tolnentry.grid (row=1, column=1) tolnentry.i
Nsert (0, "ten") Label (Root, text= "Tols"). Grid (row=2, column=0) Tolsentry = Entry (root) Tolsentry.grid (row=2, Column=1) Tolsentry.insert (0, "1.0") Button (root, Text = "Redraw", Command=drawnewtree). Grid (Row=1, column=2, rowspan=3) #
Botton button, set 1th row 2nd, column value is 3 Chkbtnvar = Intvar () #IntVar为按钮整数值小部件 Chkbtn = Checkbutton (root, Text = "Model tree", variable = chkbtnvar) Chkbtn.grid (row=3, column=0, ColumnSpan = 2) re Draw.rawdat = Mat (regtrees.loaddataset("Sine.txt")) Redraw.testdat = Arange (min (redraw.rawdat[:,0)), Max (redraw.rawdat[:,0), 0.01) Redraw (1.0,10) Root.mainloop ()
Effect: Figure 9-7: Tree Manager created with multiple tkinter parts
Integrated Matplotlib and Tkinter
Matplotlib images can be drawn on the GUI. Matplotlib builds a program that includes a front-end, such as the plot, scatter functions, and also creates a backend that implements the interface between the drawing and the different applications, and changes the back end to draw the image in png,pdf,svg format files. Matplotlib sets the backend to Tkagg,tkagg to invoke agg on the selected GUI frame, rendering agg on the canvas.
Coding:
Import matplotlib matplotlib.use ("Tkagg") #设定后端为TkAgg from Matplotlib.backends.backend_tkagg Impo RT Figurecanvastkagg from matplotlib.figure import Figure def Redraw (Tols,toln): REDRAW.F.CLF () #清空之前的图像 REDRAW.A = Redraw.f.add_subplot #重新添加子图 if Chkbtnvar.get (): # Check whether the check box is checked to determine if the model tree or the regression tree if toln<2:toln=2 mytree = Regtrees.createtree (Redraw.rawdat, Regtree S.modelleaf, Regtrees.modelerr, (tols,toln)) Yhat = Regtrees.createforecast (Mytree, Redraw.testdat, RegTrees.mode Ltreeeval) Else:mytree = Regtrees.createtree (Redraw.rawdat, ops= (tols,toln)) Yhat = Regtrees.create ForeCast (Mytree, Redraw.testdat) redraw.a.scatter (redraw.rawdat[:,0), redraw.rawdat[:,1],s=5) #画真实值的散点图 ReD Raw.a.plot (redraw.testdat,yhat,linewidth=2.0) #画预测值的直线图 reDraw.canvas.show () def getinputs (): #获取用户输入的值, Toln expects to get an integer value, Tols expects to get a floating-point number, Try:toln = Int (Tolnentry.get ()) #在Entry部件调用get方法, except : Toln = ten print "Enter Integer for Toln" Tolnentry.delete (0,end) tolnentry.insert (0, "10"
) Try:tols = float (Tolsentry.get ()) Except:tols = 1.0 print "Enter Integer for Tols" Tolsentry.delete (0,end) tolsentry.insert (0, "1.0") return Toln,tols def drawnewtree (): #有人点击ReDraw按钮时就会调用该函数 toln,tols = getinputs () #得到输入框的值 Redraw (TOLS,TOLN) #调用reDraw函数 root = Tk () redraw.f = Figure (figsize= (5,4), dpi=100) Redraw.canvas = Figurecanvastkagg (Redra W.F, Master=root) reDraw.canvas.show () ReDraw.canvas.get_tk_widget (). Grid (Row=0, columnspan=3)
Effect:
Figure 9-8: A regression tree built with a treeexplore GUI
Figure 9-9: Model tree. parameter is toln=1,tols=0 9.8 summary
The input data in the dataset and the target variable are non-linear, and the tree structure can be used to segment the predicted values, including Piecewise constants or piecewise lines. The leaf node uses the piecewise constant as the regression tree, and if the linear regression equation is the model tree.