Decision tree and random forest algorithm

Source: Internet
Author: User
Tags id3

Decision Tree

Decision tree model is a kind of tree structure, which is a process of classifying or returning instances based on feature. That is, according to a certain feature divides the data into several sub-regions (subtree), and then recursively divides the sub-region, until a certain condition is satisfied to stop dividing and as a leaf node, does not meet the conditions of the continuation of recursive division.

A simple decision tree classification model: The Red box is a feature.

Decision Tree Model learning process usually package 3 steps: feature selection, decision tree generation, decision tree pruning .

1. Feature Selection

Selecting different feature sequences will result in different decision trees, and selecting good features will make the labels more pure under each subset. There are several methods for measuring the quality of subsets, such as error rate, information gain, information gain ratio and Gini index.

1.1 Error Rate

Training Data D is divided by feature a after a number of sub-nodes, select the number of occurrences of the most numerous class label in the child node as the return value of this node, recorded as yc^. The error rate is defined as 1| d|∑i=1| dc| I{YI≠YC}

1.2 Information gain

Entropy and conditional entropy: entropy represents the measure of uncertainty of random variables. The random variable X is designed as a finite discrete random variable, and pi=p (X=XI). The entropy is defined as H (X) =?∑ni=1pilog (PI). The greater the entropy, the greater the uncertainty of the random variable, and when X takes a discrete value, the probability is 1 o'clock, then the corresponding entropy H (X) is 0, indicating that there is no uncertainty about the random variable. Conditional entropy: Indicates the uncertainty of the random variable Y under the condition of the known random variable x, and defines H (y| X) =∑ni=1pih (y| X=XI), where Pi=p (X=xi). Here x represents a feature, that is, the entropy of the data y when divided according to a feature. If a feature has a stronger ability to classify, the conditional entropy H (y| X), the smaller the uncertainty, the less.

Information gain: The information gain of feature A on training dataset D is defined as g (D,a) =h (d)-H (d| A). That is, the degree to which the eigenvalues a causes the uncertainty of data d to decline. Therefore, the greater the information gain, the more powerful the classification ability of the feature is indicated.

1.3 Information gain ratio

The information gain ratio is also a method to measure feature classification ability. Define the training data D about the entropy of the value of the characteristic a ha (D) =?∑ni=1| di| | D|LOG2 (| di| | d|), | D| represents the total number of training data, | The di| represents the total number of the first value of feature a in training data d. The greater the information gain ratio, the stronger the feature classification ability.

GR (d,a) =g (d,a) HA (D)

1.4 Gini Index

Assuming that the random variable x can take a discrete value of K, p (x=k) =pk The Gini index of X is defined as: Gini (x) =∑kk=pk (1-PK) =1-∑kk=1-pk2. For a given sample set D, its Gini index is Gini (d) =1?∑kk=1 (| ck| | d|) 2,| Ck| is the number of samples belonging to category K in D, and K is the number of classes. The Gini index indicates the degree of uncertainty of the sample set, so the smaller the Gini index, the stronger the corresponding feature classification ability .

Represents a two classification case, measuring the degree of uncertainty of a sample collection in several ways. Entropy (Entropy), Gini index (Gini), the relationship of error rate (error rates). For convenience The comparison entropy is reduced by half.

2. Generation of decision Trees

ID3 and C4.5 are the classical classification decision tree algorithms of decision tree. The only difference is that the ID3 algorithm uses the information gain as the feature selection criterion, and the C4.5 uses the information gain ratio method. The following describes the ID3 decision Tree generation algorithm, C4.5 similar.

ID3 algorithm

Starting from the root node, all possible feature A is calculated on DataSet D to calculate the information gain, the feature with the most information gain is selected as the classification condition, and the different value of the feature is set up as a sub-node, and the most subset is recursive to call the above method. Until there are no features to choose from or the information gain is small.

The following code is from the "machine learning Combat"

[Java]View Plain Copy
  1. <span style="Background-color:rgb (255,255,255)" ><code class="Hljs python" >def createtree ( DataSet, labels):
  2. Classlist = [example[-1] for example in DataSet]
  3. if Classlist.count (classlist[0]) = = Len (classlist): #数据集全部属于同一个类别
  4. return classlist[0] #返回这个类别, and as a leaf node
  5. If Len (dataset[0]) = = 1: #没有特征可以选择
  6. return majoritycnt (classlist) #返回数目最多的类别, and as a leaf node
  7. Bestfeat = Choosebestfeaturetosplit (dataSet) #选择信息增益最大的特征A
  8. Bestfeatlabel = Labels[bestfeat]
  9. Mytree = {bestfeatlabel: {}} #建立结点
  10. Del (Labels[bestfeat]) #特征集中去除特征A
  11. Featvalues = [Example[bestfeat] for example in DataSet] #特征A下的所有可能的取值
  12. Uniquevals = Set (featvalues) #去重
  13. For value in Uniquevals: #对可能取值分别建立子树
  14. Sublabels = labels[:]
  15. Mytree[bestfeatlabel][value] = Createtree (Splitdataset (DataSet, bestfeat, value), sublabels) #递归建立子树
  16. return mytree</code></span>

3. Pruning of decision Trees

without considering the complexity, a complete growth tree is easy to fit and fits well to the training data, but the results are poor for the predictive data . So pruning the resulting decision tree and cutting out unnecessary branch. You can control the complexity of the decision tree by adding regular items. The definition contractible (t) represents the loss of the decision tree, and C (t) represents the predictive error of the model to the training data. T| represents the complexity of the model, that is, the number of leaf nodes.

Contractible (t) =c (t) +α| T|

The parameter α weighs the training error and the model complexity.

(1) Calculate the empirical entropy of each node

(2) recursively backward from the leaf node of the tree, calculate the leaf node back to the parent node before and after the loss of contractible (TB) with contractible (TA), if the contractible (TA), the representative pruning this node, the loss function is smaller, then the pruning.

(3) return (2) until it cannot continue.

4.CART algorithm

The CART (classification and regression tree) algorithm is a decision tree algorithm which can classify and classify.

Feature Selection

The regression tree uses the square error minimization criterion: The square error of DataSet D is defined as ∑| D|i=1 (yi?y′) 2, where y′ represents the average of dataset D.

The classification tree selects the Gini index minimization criterion.

Cart Tree Generation

Generation of regression trees

(1) Consider all the features on DataSet D, traverse all possible values or segmentation points under each feature, and divide the dataset D into two parts D1 and D2

(2) Calculate the square error of the two subsets respectively, and select the minimum squared error corresponding to the feature and the segmentation point, and generate two sub-nodes.

(3) The above two sub-nodes recursive call (1) (2), until no feature can be selected, or the number of child nodes is small or the square error is less than the threshold value, as a leaf node, and return the mean of the sample within the node (also can be trained in the leaf node regression model as a predictive value).

Generation of classification trees

(1) Consider all the features on DataSet D, traverse all possible values or segmentation points under each feature, and divide the dataset D into two parts D1 and D2

(2) Calculating the Gini index of the two subsets respectively, and selecting the minimum Gini index and corresponding characteristic and dividing point, generating two sub-nodes.

(3) Recursive invocation of the two sub-nodes (1) (2), until no feature can be selected, or the number of child nodes is less than the threshold, or the Gini index is less than the threshold value, as a leaf node, and return the category of the most categories.

The pruning of the cart

The pruning algorithm of the cart, which is continuously pruned from the bottom of the decision tree to the root node T0. The pruning coefficient g (t) of each node T in the inner part of the complete growth tree is calculated, and the pruning coefficient represents the degree of reduction of the overall loss after pruning, that is, the degree to which the error of the trimmed data decreases before trimming. Cut the smallest g (t) in T0, and get the subtree as T1, so cut down until the root node. Using a separate validation set, the sub-tree sequence is tested to t0,t1,t2...,tn the squared error or Gini index of each subtree. The decision tree with the smallest value is the optimal decision tree.

5. Random Forest

The simplest RF (Random Forest) algorithm is a combination of bagging+ fully grown cart trees.

Through the bagging method is to establish multiple classification or regression model, the final use of voting or average as a predictive value, can reduce overfitting.

Bagging used the Boostrap sampling method to set up a decision tree for the training samples. Because the subset of samples used in each round is basically different, the training model is less correlated. To further reduce the correlation between the models, the characteristics of the training data can be randomly sampled before each round of training, and random feature selection can be performed on each branch of the decision tree.

Decision tree and random forest algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.