Linear regression is introduced earlier, but in reality, it is unrealistic to use linear regression to fit the entire dataset. In reality, data is often not globally linear.
Of course, we also introduced local weighted linear regression, which has some limitations.
Here we will introduce another idea, tree regression.
The basic idea is to use a decision tree to divide the dataset into several subsets, and then use linear regression to fit the subsets.
Decision tree is a greedy algorithm. The simplest and most typical decision tree algorithm is ID3.
ID3: selects the best features for Division each time, and determines the number of features, such as gender, which is divided into male and female.
When determining the best features, xiangnong entropy is used as an indicator to indicate whether the current division will make the data more orderly.
The limitation of ID3 is,
First, the feature can only be a discrete value.
And the division is too fast, that is, the Division is too small
Can only be used for classification issues
Therefore, the cart (Classification and regression trees) decision tree algorithm is more practical. We can see from the name that it can be used for classification problems or regression problems.
In fact, the biggest difference between cart and ID3 is that the feature is divided into two parts, which makes it easy to process continuous features and the Division speed is relatively slow.
Regression tree
Next we will look at the algorithm for building the regression tree of continuous features. The regression tree, that is, the leaf node is a specific value.
def binSplitDataSet(dataSet, feature, value): mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:][0] mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:][0] return mat0,mat1 def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)): feat, val = chooseBestSplit(dataSet, leafType, errType, ops) if feat == None: return val retTree = {} retTree[‘spInd‘] = feat retTree[‘spVal‘] = val lSet, rSet = binSplitDataSet(dataSet, feat, val) retTree[‘left‘] = createTree(lSet, leafType, errType, ops) retTree[‘right‘] = createTree(rSet, leafType, errType, ops) return retTree
First, we will give a general algorithm for the regression tree,
1. binsplitdataset
This implementation of the binary function of the dataset is hard to understand. I have implemented one by myself,
Def binsplitdataset (dataset, feature, value): m = dataset [:, feature]. t> value m = m. geta () [0] # convert numpy. convert matrix to numpy. array mat0 = dataset [m] mat1 = dataset [-M] Return mat0, mat1
It is easy to implement Boolean indexing with numpy
2. createtree
Each node of the tree, which divides features, scores, left subtree, and right subtree
So the most important thing is choosebestsplit. This function will find the best features and scores. When a leaf node is hit, the value of the leaf node will be returned.
Next we will provide the Implementation of choosebestsplit,
Def regleaf (Dataset): Return mean (Dataset [:,-1]) def regerr (Dataset): Return VAR (Dataset [:,-1]) * shape (Dataset) [0] def choosebestsplit (dataset, leaftype = regleaf, errtype = regerr, Ops = (): tols = ops [0] # Minimum Error descent value, if the error after division is smaller than the difference value, you do not need to continue Toln = ops [1] # minimum partition size. If not, you do not need to continue partitioning if Len (SET (Dataset [:, -1]. t. tolist () [0]) = 1: # If the set size is 1, do not continue to divide return none, leaftype (Dataset) m, n = shape (Dataset) S = errtype (Dataset) bests = inf; bestindex = 0; bestvalue = 0 for featindex in range (n-1): # For each feature for splitval in SET (Dataset [:, featindex]): # mat0, mat1 = binsplitdataset (dataset, featindex, splitval) if (shape (mat0) [0] <Toln) for each possible value of the feature in the training set) or (shape (mat1) [0] <Toln): Continue # The set size after division is too small. Skip this value news = errtype (mat0) + errtype (mat1) if news <bests: # If the error after division is less than bests, it indicates that the new bests bestindex = featindex bestvalue = splitval bests = news if (S-bests) is found. <tols: # If the error of reducing the optimal division is less than tols, no division is generated. The leaf node return none, leaftype (Dataset) mat0, mat1 = binsplitdataset (dataset, bestindex, bestvalue) is generated) if (shape (mat0) [0] <Toln) or (shape (mat1) [0] <Toln): # after the best division, the set is too small to be divided, generate leaf node return none, leaftype (Dataset) return bestindex, bestvalue
In addition to dataset,
Leaftype, which is a function that generates leaf nodes.
Errtype, error calculation function
For understanding how to define leaftype and errtype of the regression tree
First, how do I understand tree regression for values, that is, regression trees? My understanding is like clustering.
Therefore, the function of leaftype to generate leaf nodes is to calculate the mean, that is, to represent such data using the cluster center.
Errtype is used to calculate the variance of this set of data, that is, to divide the data by using a decision tree, so that the close data can be classified into a class.
Finally, this parameter, Ops = () is also very important, because it determines the threshold value of decision tree division, known as prepruning)
The reason for this is that it prevents the decision tree from over-fitting. Therefore, when the decrease value of the error is smaller than tols, or the size of the set after the division is smaller than Toln, stop the division.
Of course, the problem is that this algorithm is very sensitive to this parameter.
For example, for the preceding dataset, it seems that the two leaf nodes are more reasonable. If the OPS value is too small, many leaf nodes will be divided.
Therefore, an important step for Decision Tree algorithms is to solve the issue of over-fitting.
The common method is to usePruning Algorithm (pruning)
Def istree (OBJ): Return (type (OBJ ). _ name __= = 'dict ') def getmean (tree): # obtain the mean of all nodes on the tree, pruning replaces the subtree if istree (tree ['right']) with the leaf node of the mean value: Tree ['right'] = getmean (tree ['right']). if istree (tree ['left']): Tree ['left'] = getmean (tree ['left']) return (tree ['left'] + tree ['right'])/2.0def prune (tree, testdata): If shape (testdata) [0] = 0: return getmean (tree) if (istree (tree ['right']) or istree (tree ['left']): # recursive lset is required for Subtrees, rset = binsplitdataset (testdata, tree ['spind '], tree ['spval']) If istree (tree ['left']): tree ['left'] = prune (tree ['left'], lset) # recursive prune left subtree if istree (tree ['right']): tree ['right'] = prune (tree ['right'], rset) # recursion prune right subtree if not istree (tree ['left']) and not istree (tree ['right']): # If all nodes are leaf nodes, try to directly prune lset, rset = binsplitdataset (testdata, tree ['spind '], tree ['spval']) errornomerge = sum (Power (lset [:,-1]-tree ['left'], 2 )) + \ # Calculate the error sum (Power (rset [:,-1]-tree ['right'], 2) When no pruning is performed )) treemean = (tree ['left'] + tree ['right'])/2.0 errormerge = sum (Power (testdata [:,-1]-treemean, 2 )) # Calculate the error if errormerge <errornomerge: # If pruning is performed, the error print "merging" Return treemean # Return mean value can be reduced to replace the subtree for pruning. Else: return tree else: Return tree
The idea is simple,
If a subtree exists, perform recursive pruning on the subtree. Starting from the leaf node, check whether the leaf nodes are merged and the pruning can reduce the error. If yes, merge them.
Model tree)
It is an extension of the regression tree. The regression tree is actually equivalent to fitting a division with a specific value. This is too rough.
As described earlier, we can use a linear model to fit a division.
For example, if a simple regression tree is used to fit such a training set, many leaf nodes will be fitted.
If the model tree is used for fitting, there will be only two leaf nodes, each of which is a linear model, which is obviously more reasonable and easier to understand.
For the model tree, you can still directly use the above createtree
Just change leaftype and errtype,
Def linearsolve (Dataset): # linear fitting m, n = shape (Dataset) x = MAT (ones (m, n); y = MAT (ones (m, 1) x [:, 1: N] = dataset [:, 0: n-1]; y = dataset [:,-1] xtx = x. T * X if linalg. det (xtx) = 0.0: Raise nameerror ('this matrix is singular, cannot do inverse, \ n try increasing the second value of ops') Ws = xtx. I * (X. T * Y) return WS, x, y # Return the fitting result. The reason why x needs to be returned is that Y is used for subsequent calculation of errordef modelleaf (Dataset ): # parameter ws WS, X, Y = linearsolve (Dataset) return wsdef modelerr (Dataset): Ws, X, Y = linearsolve (Dataset) for Linear Model storage on leaf nodes) yhat = x * ws return sum (Power (Y-yhat, 2) # Calculate the squared difference between the predicted value and the real value
Machine Learning in action-tree regression