Machine Learning Classic algorithm and Python implementation--cart classification decision tree, regression tree and model tree

Source: Internet
Author: User
Tags id3

Summary:

Classification and Regression tree (CART) is an important machine learning algorithm that can be used to create a classification tree (classification trees) or to create a regression tree (Regression tree). This paper introduces the principle of cart used for discrete label classification decision and continuous feature regression. The decision tree creation process analyzes the information Chaos Measure Gini index, the special processing of continuous and discrete features, the special processing and post-pruning of functions in the coexistence of continuous and discrete features, and the principle, applicable scene and creation process of regression tree and model tree. In my opinion, the regression tree and the model tree can be regarded as the "Community Classification" algorithm, the tag values in the community are continuously distributed, and the community has a demarcation.

(i) know the cart algorithm

The classification and Regression tree (CART) is one of the decision trees and is a very important decision tree, belonging to the top Ten machine learning algorithm. As the name implies, the cart algorithm can be used both to create a classification tree (classification tree), or to create a regression tree (Regression trees), model tree, the two are slightly different in the process of building. In this paper, "The classical algorithm of machine learning and the implementation of Python (decision tree)", the principle of classification decision tree and the algorithm of ID3 and C4.5 are introduced in detail, and the application of cart algorithm in decision tree classification and tree regression is described based on this article.

In the process of creating a classification tree, the cart selects the feature with the minimum Gini information gain in the current data set as the node partition decision tree each time. Although the ID3 algorithm and the C4.5 algorithm can excavate the information as much as possible in the learning of the training sample set, the decision tree branches and the large scale of the cart algorithm can simplify the decision tree size and improve the efficiency of the decision tree. For continuous features, the cart is also handled in the same way as C4.5. To avoid overfitting (Overfitting), the cart decision tree needs pruning. The prediction process is also very simple, according to the resulting decision tree model, the extension of matching eigenvalues to the last leaf node is the category of prediction.

When you create a regression tree, the values of the observations are contiguous, with no categorical labels, and only the values derived from the observed data create a predictable rule. In this case, the optimal partitioning rules of the classification tree are powerless, and the cart uses the minimum residual variance (squared residuals minimization) to determine the optimal division of the regression tree. The partitioning criterion is that the sub-tree error variance is the smallest after the desired partition. The model tree is created, and each leaf node is a machine learning model, such as a linear regression model.

The important foundation of the CART algorithm consists of the following three aspects:

(1) Two points (binary split): In each judgment process, the observation variables are two points.

The cart algorithm uses a binary recursive segmentation technique, and the algorithm always divides the current sample set into two sub-sample sets, so that each non-leaf node of the resulting decision tree has only two branches. Therefore, the decision tree generated by the cart algorithm is a simple structure of two-fork tree. Therefore, the cart algorithm is suitable for the scene with the value of the sample feature being or not, and the processing of the continuous feature is similar to the C4.5 algorithm.

(2) univariate split (split Based on one Variable): Each optimal partition is for a single variable.

(3) Pruning strategy: The key point of the cart algorithm is also the key step of the whole tree-based algorithm.

The pruning process is particularly important, so it occupies an important position in the optimal decision tree generation process. Studies have shown that the pruning process is more important than the tree generation process, and for the largest tree (Maximum trees) generated by different partitioning criteria, the most important attribute partitioning can be preserved after pruning, with little difference. The pruning method is more critical to the generation of the optimal tree.

(ii) CART classification decision Treeinformation theory basis and algorithm process of 1,cart

The difference between cart and C4.5 is that node splitting is based on the concept of Gini index, the Gini index is mainly divided into metric data or the purity of training data set D. The smaller the Gini value, the higher the purity of the sample (i.e., the higher the probability that the sample belongs to the same class). After measuring the Gini exponent of all the values of a feature in a dataset, you can get the feature's Gini Split info, which is ginigain. In the case of pruning, the recursive creation process of the categorical decision tree is to select the smallest node of ginigain each time to do the fork point, until the sub-datasets belong to the same class or all the features are exhausted.

Because of the characteristics of the cart dichotomy, when the training data has more than two categories, the cart needs to consider merging the target category into two superclass, a process called a double. How do super categories differentiate the categories? Further classified according to other characteristics? TBD

(1) The concept of the Gini index:

The Gini index is an unequal measure, often used to measure income imbalance, which can be used to measure any uneven distribution, a number between 0~1, 0-exactly equal, 1-completely unequal. When classifying measures, the more cluttered the population contains, the greater the Gini index (similar to the concept of entropy).

For a DataSet T, its Gini is calculated as:

(n = number of categories, PJ table DataSet sample different categories of probabilities)

(2) Ginigain

The Gini index that measures all the values of a feature can be Gini Split Info

, I represents the first I value of a feature

The information gain in the ID3 algorithm is similar, and this can be called Gini information gain--gini Gain. For cart,i=, get the Gini information gain in binary split case:


2, to the discrete distribution, and the number of values >=3 characteristics of processing:

It is precisely because the cart tree is a two-fork tree, so there can be only two branches for the processing of discrete features with n>=3 values for the sample, which is to create two sequence of values by combining and take ginigain minimum as the decision point of the Tree fork. If a characteristic value has [' Young ', ' middle ', ' old '] three values, then the binary sequence will have the following 3 possibilities (empty set and full sets have no meaning in the cart category):

[(' Young ', "), (' Middle ', ' old '), (' Middle ',"), (' Young ', ' old '), (' Young ', ' middle '))]

Using the cart algorithm, it is necessary to calculate the Gini exponent of bifurcation in accordance with the binary sequence in the list above, and then select the binary sequence of the smallest ginigain to do the bifurcation of the feature to participate in the recursion of tree construction. If a feature has a value of 4, then there are 7 binary sequence combinations, 5 values There are 15 combinations, creating a multi-valued discrete feature binary sequence combination can be used Python itertools package, the program is as follows:

Source Code:Copy
  1. from Itertools Import *
  2. Import PDB
  3. def Featuresplit (features):
  4. count = len(features)
  5. Featureind = Range(count)
  6. Featureind. Pop (0) #get value 1~ (count-1)
  7. Combilist = []
  8. for i in Featureind:
  9. com = list(combinations (features, len(features[0:i)))
  10. Combilist. Extend (COM)
  11. Combilen = len(combilist)
  12. Featuresplitgroup = zip(COMBILIST[0:COMBILEN/2], combilist[combilen-1:combilen/2-1:-1])
  13. return Featuresplitgroup
  14. if __name__ = = '__main__':
  15. test= Range(3)
  16. Splitgroup = featuresplit (test)
  17. Print ' Splitgroup ', len(splitgroup), Splitgroup
  18. test= Range(4)
  19. Splitgroup = featuresplit (test)
  20. Print ' Splitgroup ', len(splitgroup), Splitgroup
  21. test= Range(5)
  22. Splitgroup = featuresplit (test)
  23. Print ' Splitgroup ', len(splitgroup), Splitgroup
  24. test= [' Young ', ' middle ', ' old ']
  25. Splitgroup = featuresplit (test)
  26. Print ' Splitgroup ', len(splitgroup), Splitgroup

Therefore, the cart does not apply to discrete features with multiple values possible scenarios. At this point, if you want to use the cart, it is best to pre-artificially reduce the value of discrete features.

So for the left and right branches after the two, if the feature value tuple more than 2 elements, whether the feature will continue to participate in the current sub-dataset of two points? TBD

I think it is necessary, therefore, that the feature continues to participate in categorical decision tree recursion until the value of the feature is unique on the left and right branches (that is, the feature is no longer included). Then the Datasplit function of the discrete feature should be: If the feature value tuple>=2 on the branch after branching according to the current branch feature, the branching sub-dataset retains the feature, and the tuple continues to participate in the recursive tree build, otherwise the branching sub-dataset deletes the feature.

Source Code:Copy
  1. def splitdataset (dataSet, Axis, valueTuple):
  2. "return dataset satisfy condition dataset[i][axis] = = ValueTuple,
  3. and remove Dataset[i][axis] If Len (valueTuple) ==1 "
  4. Retdataset = []
  5. Length = len(valueTuple)
  6. if length ==1:
  7. for Featvec in DataSet:
  8. if featvec[axis] = = Valuetuple[0]:
  9. Reducedfeatvec = Featvec[:axis] #chop out axis used for splitting
  10. Reducedfeatvec. Extend (featvec[axis+1:])
  11. Retdataset. Append (Reducedfeatvec)
  12. Else:
  13. for Featvec in DataSet:
  14. if Featvec[axis] in valueTuple:
  15. Retdataset. Append (Featvec)
  16. return Retdataset
3, the processing of continuous features

The discrete process of the continuous attribute reference C4.5, the difference is that the Ginigain minimum is chosen as the boundary point selection criterion in the cart algorithm. Do I need to fix it? The processing process is:

The continuous attribute is converted to a discrete attribute before processing. Although the value of the property is continuous, but for the finite sampling data it is discrete, if there are n samples, then we have N-1 species Discretization method: <=vj to the left sub-tree, >VJ to the right sub-tree. Calculate the maximum information gain rate for this N-1 case. In addition, for sequential attributes to be sorted first (ascending), it is only necessary to cut when the decision attribute (that is, the classification has changed) has changed, which can significantly reduce the computational volume.

(1) Sorting the values of the features in ascending order

(2) The midpoint between the two feature values as a possible splitting point, divides the data set into two parts, and calculates the ginigain of each possible split point. The optimization algorithm is to calculate only those feature values that have changed the classification attributes.

(3) Select the Ginigain smallest split point as the best splitting point of the feature (note that if the correction is needed, the Gini of the best splitting point is gain minus log2 (N-1)/| D| (n is the number of consecutive features, and D is the number of training data)

A Python program that implements the partitioning of a continuous feature dataset (using the NumPy matrix, the sequential feature value can omit the sort step):

Source code: Copy

    1. def binsplitdataset (DataSet, feature, value):
    2. Mat0 = Dataset[nonzero (Dataset[:,feature] > value) [0],:][0]
    3. MAT1 = Dataset[nonzero (Dataset[:,feature] <= value) [0],:][0]
    4. Return MAT0,MAT1

Where the dataset is NumPy matrix, feature is the index,value of the dataset continuous feature in all features of the dataset, which is a value of feature.

It is important to note that when partitioning a dataset based on a discrete feature branch, the feature is no longer included in the sub-dataset (because the value of the feature is the same for the sub-datasets under each branch, the information gain or Gini gain will no longer change), and when branching according to continuous features, The sub-datasets under each branch must still contain the feature (of course, the left and right branches each contain a sub-dataset with a value less than, greater than or equal to the split value), because the continuous feature may still play a decisive role in the subsequent tree branching process.

4, training data Summary The processing of discrete features and continuous feature mixing existence

C4.5 and cart algorithm decision tree creation process, because of discrete features and continuous characteristics of the processing function is different. When the two features coexist in the training data, the distribution type must be recognized to invoke the corresponding function. Then there are two ways:

(1) Each feature indicates whether the distribution is continuous or discrete, such as 0 for discrete, 1 for continuous. The distribution type can be distinguished during this training and decision-making.

(2) The number of features in the function, such as featurevaluecount>10 (of course, the discrete features can not be so many values) is a continuous distribution, otherwise, discrete distribution. In the decision tree model that is built, you must indicate the distribution type of the feature (such as building a list with a length of Featurecount, where element 0: discrete, 1: continuous).

Note: For discrete features with or no value, the discrete or continuous distribution is possible. According to the continuous distribution is simple, take std=0.5 can simply realize split. At this point the distribution criterion is changed to featurevaluecount>20 or ==2.

(3) using the single-Hot Code (onehotencoding), the Python Sklearn preprocessing provides Onehotencoder () to convert discrete values into continuous value processing. The single-Hot code is one-hot encoding, also known as a valid encoding, the method is to use n-bit status register to encode n states, each state by his independent register bit, and at any time, only one of them is valid. For each feature, if it has m possible values, it becomes a M two-dollar feature after being encoded by a single heat. Also, these features are mutually exclusive, with only one activation at a time. As a result, the data becomes sparse. The main advantages of this are: to solve the problem that the classifier does not handle the attribute data, and to a certain extent, it also plays the role of expanding the feature. Referto ' Onehotencoder for data preprocessing '.

Pruning of 5,cart

It is not difficult to find out that there is a data overfitting problem in the recursive process of classification regression tree. In the decision tree construction, because of the noise or outliers in the training data, many branches reflect the anomalies in the training data, using such a decision tree to classify the unknown categories of data, the accuracy of classification is not high. So trying to detect and subtract such branches, the process of detecting and subtracting these branches is called tree pruning. Tree Pruning method is used to deal with the problem of overly adaptive data. Typically, this approach uses statistical measures minus the most unreliable branches, which results in faster classification and improved tree independence from the ability to properly classify the training data. There are two common approaches to pruning commonly used in decision trees: pre-pruning (pre-pruning) and post-pruning (post-pruning). Pre-pruning is based on a number of principles of early stop tree growth, such as the depth of the tree to reach the depth of the user, the number of samples in the node less than the user specified number, the most significant decrease in the purity index is less than the user-specified amplitude, etc. after pruning is achieved by cutting branches in a fully grown tree, By removing the node's branches to cut the tree nodes, there are many post-pruning methods that can be used, such as cost complexity pruning, minimum error pruning, pessimistic error pruning, and so on.

The second key in the process of building decision tree is to prune the tree of training set with independent validation data set. TBD

The specific theory of post-pruning can be referred to the pruning section of "Data mining ten classical algorithms--cart: Classification and regression tree".

6,python Implement cart decision tree

Relative to ID3, C4.5 decision tree algorithm, the implementation process of the CART algorithm is similar in structure, the difference is:

(1) The best characteristic measure takes Gini Gain, so Calcshannonent method is replaced Calcgini method

(2) The cart takes the dichotomy, so for discrete features with multiple values, the minimum binary sequence and its ginigain need to be obtained first, so the Splitdataset method must be separated by the value tuple, Choosebestfeturetosplit to return the best fork points and their binary sequences such as (' Middle ',), (' Young ', ' old ').

(3) The decision tree Model decision algorithm for discrete features should also be based on the eigenvalues of the tuple, that is to judge whether the value of the eigenvalue belongs to the left branch or branch; for continuous features, it is to judge whether the eigenvalue value is greater than the split value or less than or equal to the split value.

(iii) CART regression tree and model tree

When the data has many characteristics and the relationship between features is very complex, the idea of building a global model is too difficult and slightly awkward. Moreover, many problems in real life are non-linear, and it is not possible to fit any data using a global linear model. One way to do this is to cut the dataset into a number of easily modeled data, and then use linear regression techniques to model it. If the linear model is still difficult to fit after the first slice, continue slicing. In this way, tree structure and regression method are very useful.

The regression tree is similar to the classification tree, but the data type of the leaf node is not discrete, but continuous, and the cart can be modified to deal with the regression problem. The cart algorithm can be divided into regression tree and model tree according to whether the leaf is specific or different machine learning model. However, whether it is a regression tree or a model tree, its application scenarios are: Tag values are continuous distribution, but can be divided into communities, there is a distinct difference between communities, that is, each community within a similar continuous distribution, the distribution of the community is indeed different. So the regression tree and the model tree are both regressive and classified.

Regression is to deal with a scenario where the predicted value is continuously distributed, and the return value should be a specific forecast value. The leaf of the regression tree is a specific value, from the meaning of continuous prediction values, the regression tree can not be called "Regression algorithm." Because the regression tree returns the mean value of "a cluster" of data, rather than a specific, continuous predictor (i.e., the label value of the training data is continuous, the predicted value of the regression tree can only be discrete). So the regression tree can also be classified as a "classification" algorithm, the application of the scene to have a "feather flock" characteristics, that is, the combination of eigenvalues will make the label belongs to a "community", the community will have a relatively distinct "gap." If a person's style is a continuous distribution, but can be "group" into literary, ordinary and 2B three communities, the use of regression tree can be judged whether a person is literary or 2B, but can not measure the number of literary or multi-2B. Therefore, the use of regression tree can divide the complex training data into a relatively simple community, the community can re-use other machine learning model to learn.

The leaf of the model tree is a machine learning model, such as a linear regression model, so it is more called the "regression" algorithm. A model tree can be used to measure a person's literary value.

Regression trees and model trees also need pruning, and the pruning theory is the same as the classification tree. In order to obtain the best model, tree pruning is often done by using pre-pruning and post-pruning methods.

So how do you use a cart to build a regression tree or a model tree? And listen to the following thin way.

1, Regression tree-Select branching features using difference values

In tree regression, it is necessary to measure the consistency of data in order to successfully construct a tree with segmented constants as leaf nodes. When a taxonomy decision tree is created, the confusion of categorical data is calculated at the given node. So how do you calculate the chaos of a continuous number? In fact, it is very simple to calculate the degree of chaos on a continuous data set--measure the total difference of the label data before and after a certain feature, and each time select the feature that minimizes the total data value to make the best branching feature in order to treat the positive and negative difference equally, generally use absolute or square value to replace the above difference. Why choose to calculate the difference? The smaller the difference, the higher the similarity, the more likely it belongs to a community. Then if you choose Variance to do the difference, the total variance is calculated in two ways:

(1) Calculate the data set mean STD, calculate the variance of each data point and STD, and then sum the n points.

(2) Calculate the data set variance var, then Var_sum = Var*n,n is the number of data set data. The Var method can be used in Python matrix to obtain the variance of the dataset, so the method is simple and convenient.

Similar to the Gini gain for discrete and continuous features, the multi-valued discrete feature needs to select the optimal binary sequence, and the continuous feature is to find the optimal splitting point.

So, the selection process for each of the best branching features is:

function Choosebestsplitfeature ()

(1) The best variance of the shilling is the infinite large bestvar=inf.

(2) Calculating the total variance of the data according to a certain characteristic (Featurecount iteration) in turn (the sum of the total variance of the left and right sub-datasets after partitioning), if Currentvar<bestvar, then bestvar= Currentvar.

(3) Returns the best branching feature, the branching feature value (the discrete feature is a two-minute sequence, the continuous feature is the value of the split point), and the left and right branching sub-datasets.

2, adopt linear regression prediction deviation to build model tree

Using a tree to model data, in addition to simply setting the leaf node as a constant value, there is one way to set the leaf node as a piecewise linear function, where the so-called piecewise linear (piecewise linear) refers to the model is composed of multiple linear fragments, which is the model tree. The explanatory nature of the model tree is one of its advantages over the regression tree. In addition, the model tree also has higher predictive accuracy.

The creation of the model tree is basically the same as the regression tree, and the difference lies in the calculation of the optimal branch feature selection of the time difference value in the recursive process. For the Model tree: the given data set is first fitted with a linear model, then calculates the difference between the true target value and the model prediction value, sums the squares of the difference to get the total difference, and finally selects the feature with the smallest total difference to do the branching feature. As for the method of linear regression, we should refer to the solution of linear regression model.

Python implementation of 3,cart regression tree and model tree

The cart regression tree and the Model Tree learning package are:

TBD

Reference:

Decisiontree:cart, pruning

The author of this paper, Adan, derives from: The classical algorithm of machine learning and Python implementation of--cart classification decision tree Classification and regression tree. Reprint please indicate the source.

Machine Learning Classic algorithm and Python implementation--cart classification decision tree, regression tree and model tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.