These days to complete the tree return to the relevant learning, this part of the content is very much, the harvest is also quite a lot, just finally completed the whole content, very happy.
Tree regression This chapter deals with the classification and regression tree of the Cart,cart tree called (Classify and regression tree), which can be used for both classification and regression. This is what the previous decision tree did not say, and add it here. Just to summarize the 3 decision trees we learned.
ID3: Using information gain to select characteristics to classify, can only deal with classification problems. Disadvantage is often biased to the characteristics of a variety of properties to decompose, such as the characteristics A has 2 options, characteristic B has 3 options, the chaotic degree is similar to the case, ID3 will bias the choice of characteristic B, this will make the tree become complex, increase the difficulty of making a contribution, the increase of the node also makes the appearance of the fitting situation. Because of this shortcoming, we invented the optimized version of the ID3 C4.5.
C4.5: Using the information gain rate to select the characteristics of the classification, the information gain and the number of characteristics of the type of the amount of offsets, the information gain rate can objectively indicate the choice of a characteristic label confusion degree.
CART: Use the Gini index (a concept similar to entropy) to select attributes for classification. The total variance is used to predict the regression. The above two algorithm segmentation feature is once selected this feature, the feature will no longer work, because the segmentation is too ' fast ', can not better utilize the value of the property, the cart algorithm turned out.
In the above ID3 and C4.5, can only deal with the classification problem, in the face of the label is continuous value can not be processed, so the advantages of the cart is reflected, cart can be regression analysis, predicting regression value, called Tree regression, cart can be pruning operation to avoid overfitting, divided into pre-pruning and post-pruning, after pruning, pre-pruning for modification Function of the parameter ops. It is also possible to change the regression values of the leaf nodes into linear regression, called model trees. The regression effect is slightly better than simple linear regression (the last three regression results we can see).
Step1:
Read data:
#读取数据def loaddataset (filename): feanum = len (open (filename). ReadLine (). Strip (). Split (' \ t ')) Datamat = [] f = open (filename) for i in F.readlines (): Line = I.strip (). Split (' \ t ') L = [] for j in Range (feanum) : l.append (float (line[j])) datamat.append (L) return Datamat
Step2:
To create a regression tree recursively:
#创建树 (OPS two parameters first represents the minimum allowable error, the second is the sample minimum capacity) def createtree (DataSet, Leaftype = regleaf, ErrorType = regerror, Ops = [1, 4]): t ree = {} fea, val = choosebestfeatrue (DataSet, Leaftype, ErrorType, OPS) if fea = = None: return val tree[' FEA '] = FEA tree[' val '] = val lefttree, Righttree = Splitdataset (DataSet, FEA, Val) tree[' Left ' = Createtree (Lefttree, Leaftype, ErrorType, OPS) Tree[' right '] = Createtree (Righttree, Leaftype, ErrorType, OPS) return tree
Step3:
Split subset functions in a regression tree:
#分类函数def splitdataset (data, Featrue, Val): mat0 = [] MAT1 = [] data = Mat (data) m = shape (data) [0] fo R i in range (m): if data[i, Featrue] > val: mat0.append (data[i].tolist () [0]) else: mat1.append ( Data[i].tolist () [0]) return mat0, MAT1
STEP4:
Find the best properties function in the regression tree:
Stop condition:
1: Fewer than 4 subsets of categories
2: error value reduction is less than the minimum allowable value
3: Tag values are duplicates, no decomposition is required, return to leaf node.
#寻找最佳分类特性def choosebestfeatrue (DataSet, Leaftype = regleaf, ErrorType = regerror, Ops = [1, 4]): D = Mat (DataSet) Le Astnum = ops[1] Leasterr = ops[0] m, n = shape (d) If Len (Set (d[:,-1). T.tolist () [0]) = = 1:return None, Leaftype (d) errorsum = ErrorType (d) Bestfea =-1 Bestval = 1 Bester ROR = inf for i in Range (n-1): For temp in Set (d[:, I]. T.tolist () [0]): mat0, MAT1 = Splitdataset (d, I, temp) if shape (mat0) [0] < Leastnum or shape (MAT1) [0] < Leastnum:continue tempsumerror = ErrorType (mat0) + ErrorType (MAT1) if temps Umerror < Besterror:bestfea = I bestval = Temp Besterror = Tempsumerror if (Errorsum-besterror) < Leasterr:print errorsum, Besterror return None, Leaftype (d) mat0, MAT1 = Splitdataset (d, Bestfea, bestval) if shape (mat0) [0] < Leastnum or shape (MAT1) [0] < Leastnum:returnNone, Leaftype (d) return BESTFEA, Bestval
STEP5:
Related functions:
#计算平方误差def regerror (DataSet): T = Mat (DataSet) return var (t[:,-1]) * SHAPE (t) [0] #计算叶子节点的值 (mean of leaf node) def Regleaf (DataSet): T = Mat (DataSet) return mean (t[:,-1])
#判断节点是否为树def Istree (tree): return Isinstance (tree, Dict) #节点值的平均值def Getmean (tree): if Istree (tree[' left '): tree[' left '] = Getmean (tree["left") if Istree (tree[' right ']): tree["Right" = Getmean (tree[' right ']) return (tree[' right ') + tree[' left ')/2.0
Let's look at the effect:
Figure 1:
The result is:
Figure 2:
The result is:
We see it in line with our expectations.
STEP6:
Pruning of regression trees: Here The pruning is different from the pruning algorithm, the pruning of the algorithm is to cut out unnecessary nodes, and the pruning here is to combine the tedious nodes. The destination bit avoids the cart overfitting. Divided into pre-pruning and post-pruning,
This is a post-pruning, pre-pruning parameter ops that modifies the function.
Here are two details:
1: When no data enters the tree node, we merge the tree into a single node.
2: When the left and right nodes of the node are values, we decide whether to separate or merge well.
For better pruning results, pre-pruning and post-pruning should be used together.
#回归树剪枝def Prunetree (Tree, testdata): If shape (testdata) [0] = = 0:return Getmean (tree) if Istree (tree[' left ') or Istree (tree[' right '): LSet, RSet = Splitdataset (testdata, tree[' fea '), tree[' Val ']) if Istree (tree[' left ') : tree[' Left ' = Prunetree (tree[' left '], LSet) if Istree (tree[' right '): tree[' right '] = Prunetree (tree[') Right "], RSet) if (not Istree (tree[" left ")) and (Not Istree (tree[" Right ")): l, R = Splitdataset (TestData, tree [' FEA '], tree[' Val ']) LSet = Mat (l) RSet = Mat (r) e1 = 0.0 e2 = 0.0 #书中忽略了这一点可能会有一个集合没有 Element if shape (LSet) [1]! = 0:E1 = SUM (Power (lset[:,-1]-tree[' left '), 2)) if shape (RSet) [1]! = 0:e2 = SUM (Power (rset[:,-1]-tree[' right '), 2)) Errornomerge = e1 + E2 average = (tree[' left "+ tree[' right '])/2.0 test = Mat (testdata) Errormerge = SUM (Power (test[:,-1]-average, 2 )) If Errormerge< Errornomerge:print ' merging ' return average Else:return tree else: Return tree
STEP7:
Model Tree:
The parametric functions in the regression tree are modified and the linear regression functions with the leaf nodes are added.
#模型树线性回归函数 (Simple linear regression function, error when matrix inverse does not exist) def linesolve (data): Linedata = Mat (data) m, n = shape (linedata) x = Mat ( Ones ((M, N))) x[:, 1:n] = linedata[:, 0:n-1] y = linedata[:,-1] xtemp = x.t * x if Linalg.det (xtemp) = = 0.0: print ' error, no inverse matrix ' return w = xtemp. I * x.t * y return x, y, w# model tree Error def moderror (DataSet): X, y, W = linesolve (DataSet) Yhat = x * W return sum (Power (Yhat-y, 2)) #模型树的叶子节点 def modleaf (DataSet): X, y, W = linesolve (DataSet) Return W
The figure is:
The result is:
The value of the leaf node is w,x * W to get the predicted value of Y.
STEP8:
Predictive functions for tree regression:
#回归值 (tree regression) def regvalue (value, DataSet): return value# regression value (model tree regression) def modvalue (value, DataSet): data = Mat ( DataSet) n = shape (data) [1] x = Mat (Ones ((1, n + 1)) x[:, 1:n + 1] = data yhat = x * Value return Yhat #树回归预测def Predicttree (tree, T, valuetype = regvalue): testdata = Mat (t) m, n = shape (testdata) YHA T = Mat (Zeros ((M, 1))) for I in Range (m): yhat[i, 0] = Predictvalue (tree, Testdata[i], valuetype) return YH at# Regression Predictive value def predictvalue (tree, Test, valuetype = Regvalue): If not istree (tree): return ValueType (tree, test) if test[tree[' fea '] > tree[' val ']: return Predictvalue (tree[' left ', Test, valuetype) else: return Predictvalue (tree[' right '), Test, ValueType)
Next we make 3 regression comparisons:
The figure is:
The closer the correlation coefficient is to 1, the greater the correlation.
1:cart Tree regression:
The correlation coefficients are:
2: Model tree:
The correlation coefficients are:
3: Linear regression
The correlation coefficients are:
Visible size order for model tree > Regression tree > Linear regression
Finally completed the supervision of learning related parts, SVM still do not understand, have the opportunity to learn the application of SVM, tomorrow began to learn unsupervised learning, first learning clustering algorithm. Come on!
Machine Learning DAY14 machine learning combat tree regression cart and model tree