Decision tree algorithm Summary

Source: Internet
Author: User

Reference: Machine Learning Tom and http://blog.csdn.net/v_july_v/article/details/7577684

I. Introduction

Decision treeIs a prediction model, which represents a ing relationship between object attributes and object values. Each node in the tree represents an object, and each forks PATH represents a possible attribute value, each leaf node corresponds to the object value represented by the path from the root node to the leaf node. A decision tree has only one output. To have a plural output, you can create an independent decision tree to process different outputs. Decision tree is a frequently used technology in Data Mining. It can be used for data analysis and prediction (just as the bank officials above use it to predict loan risks ).

The machine learning technology used to generate decision trees from data is calledDecision Tree LearningIn other wordsDecision tree.

A decision tree contains three types of nodes: 1. Decision nodes-usually using rectangular boxes for table 2. Opportunity nodes-usually using circles for table 3. endpoints-usually represented by triangles

Decision tree learning is also a common method in data exploration. Each decision tree represents a tree structure, and its branches classify objects of this type by attributes. Each decision tree can be used to test data by splitting the source database. This process can recursively trim the tree. The recursive process is complete when no further division or a separate class can be applied to a branch. In addition, the random forest classifier combines many decision trees to improve the classification accuracy.

Ii. Decision Tree Algorithms

1. ID3 algorithm

ID3 is an algorithm used for Decision Trees invented by Ross Quinlan. This algorithm is based on the Occam Razor described above: the smaller the decision tree, the better the decision tree (be simple theory ). However, this algorithm does not always generate the smallest tree structure, but a heuristic algorithm.

Description of ID3 algorithm in Tom Michelle machine learning:

ID3 algorithm concept description: (personal summary is for reference only)

A. Calculate the information gain of attributes for the current example set;

B. Select the attribute Ai with the largest information gain (detailed descriptions will be provided later)

C. The example of the same value in Ai is equivalent to a subset. When Ai gets a few values, it gets several subsets.

D. recursively call the build Algorithm for the subsets of each value in turn, that is, return,

E. If the subset contains only one attribute, the Branch is a leaf node, the attribute value is determined, the corresponding symbol is marked, and the call is returned.

 

2. Optimal Classification attributes

Determining which attribute to test is the best classification attribute is the core issue of the ID3 algorithm. Here we will introduce two important concepts: information Gain measurement standards: entropy and information Gain (S, a)

The contents of machine learning and the reference section are modified as follows:

1) information gain measurement standard: entropy

To precisely define information gain, we first define a widely used measurement standard in information theory, calledEntropy(Entropy), which depicts the purity of any sample set (purity ). Given the sample set S that contains a positive and negative sample of a certain target concept, the entropy of S relative to this Boolean classification is:

In the above formula, p + indicates the positive sample. For example, in the second example at the beginning of this article, p + indicates playing badminton, while p-indicates the inverse sample, do not play the game (in all calculations about entropy, we define 0log0 as 0 ).

Related code implementation: (the code is somewhat obscure, for details and understanding please see: http://blog.csdn.net/yangliuy/article/details/7322015 contains complete ID3 code)

// Calculate the entropy double ComputeEntropy (vector <string> remain_state, string attribute, string value, bool ifparent) {vector <int> count (2, 0) based on the specific attribute and value ); unsigned int I, j; bool done_flag = false; // The value of the guard for (j = 1; j <MAXLEN; j ++) {if (done_flag) break; if (! Attribute_row [j]. compare (attribute) {for (I = 1; I <remain_state.size (); I ++) {if ((! Ifparent &&! Remain_state [I] [j]. compare (value) | ifparent) {// if (! Remain_state [I] [MAXLEN-1]. compare (yes) {count [0] ++;} else count [1] ++;} done_flag = true ;}} if (count [0] = 0 | count [1] = 0) return 0; // both positive and negative instances // calculate the entropy based on [+ count [0],-count [1], log2 is replaced by the bottom-changing formula with the natural number double sum = count [0] + count [1]; double entropy =-count [0]/sum * log (count [0]/sum)/log (2.0) -count [1]/sum * log (count [1]/sum)/log (2.0); return entropy ;}

For example, suppose S is a set of 14 examples about the Boolean concept. It contains 9 positive examples and 5 inverse examples (we use the mark [9 +, 5-] To summarize such a data sample), then the entropy of S relative to this Boolean sample is:

Entropy ([9 +, 5-]) =-(9/14) log2 (9/14)-(5/14) log2 (5/14) = 0.940.

Note: if all the members of S belong to the same class, Entropy (S) = 0. For example, if all the members are positive (p + = 1), p-is 0, so Entropy (S) =-1 * log2 (1)-(0) log2 (0) = 0; in addition, the number of positive and negative samples of S is equal, Entropy (S) = 1; the number of positive and negative samples of S is unequal, and the entropy is between 0 and 1, as shown in: An Explanation of entropy in information theory, entropy determines the minimum binary digits required for the classification of any member in the Set S. Generally, if the target attribute has c different values, the entropy of S relative to the c state is defined as: where pi is the ratio of S to Class I, note that the base number is still 2, because entropy measures the encoding length based on the number of binary bits. Note that if the target attribute has c possible values, the maximum entropy is log2 (c ). 2) information Gain (S, A) Definition and information Gain measure expected entropy reductionWith entropy as the standard for measuring the purity of the training sample set, we can now define a measurement standard for the effectiveness of attribute classification training data. This standard is called "information gain )". Simply put, the information gain of an attribute is the reduction of the expected entropy caused by the use of this attribute to split the sample (or the expectation that the samples will decrease the entropy when divided by a certain attribute, in combination with the previous understanding, the individual summed up to measure the ability of a given attribute to distinguish training samples ). More precisely, the information Gain (S, A) of the sample set of an attribute A is defined as: Values (A) is A set of all possible Values of attribute, sv is the subset of the value of attribute A in S as v. Note that the first item in the above formula is the entropy of the original set S, and the second item is the expected value of the entropy after classification S by, the expected entropy described in the second item is the weighted sum of Entropy of each subset. The weight is the proportion of the sample with the property Sv to the original sample S. | Sv |/| S |, therefore, Gain (S, A) reduces the expected entropy caused by knowing the value of attribute A. In other words, Gain (S,) it is the information about the target function value given the value of attribute. When the value of any member of S is encoded, the value of Gain (S, A) is the number of binary digits that can be saved after the value of attribute A is known. In summary, we can draw two basic formulas: The first Entropy (S) is the definition of Entropy, and the second is the definition of information Gain (S,, however, Gain (S, A) calculates from the first Entropy (S) that the following uses the content described in machine learning as an example to assume that S is A set of training samples related to the weather, its attributes include Wind with Weak and Strong values. As before, assume that S contains 14 samples: [9 +, 5-]. In these 14 examples, we assume that Wind = Weak exists in the 6 and 2 in the inverse examples, and Wind exists in the other examples as Strong. The information gain obtained from 14 samples of Wind classification by attribute can be calculated as follows. Information gain is the measure that selects the best attribute in each step of the ID3 algorithm growth tree (copied online, unfortunately, there is no clear version). Two different attributes are calculated: humidity) and wind to determine which of the following attributes is better for the training sample.

Compared with the target, Humidity has greater information gain than Wind through the above calculation.

Extracted from machine learning is part of the decision tree formed after the first step of ID3. In this decision tree, the maximum information gain of OutLook is selected as root.

All samples of the middle branch Overcast are positive examples, so they become leaf nodes with the target category as Yes. The other two knots are further expanded by selecting the attribute with the highest information gain according to the new sample set.

For the complete code above, see http://blog.csdn.net/yangliuy/article/details/7322015

3. Another decision tree algorithm C4.5

Here is a brief introduction.

1) Overview:

Due to some problems in actual application of the ID3 algorithm, Quilan proposed the C4.5 algorithm. Strictly speaking, C4.5 can only be an improved algorithm of ID3.

The C4.5 algorithm inherits the advantages of the ID3 algorithm and improves the ID3 algorithm in the following aspects:

  • The information gain rate is used to select attributes, which overcomes the shortcomings of attributes with many options; for the definition of information gain rate, refer to section 1.2 of "Decision Tree Classification Technology Research" by Yan Lihua and Ji genlin.
  • Pruning during tree construction;
  • Discretization of continuous attributes;
  • Can process incomplete data.

The C4.5 algorithm has the following advantages: the generated classification rules are easy to understand and have high accuracy. The disadvantage is that the dataset needs to be scanned and sorted multiple times during tree construction, which leads to inefficient algorithms. In addition, C4.5 is only applicable to datasets that can reside in the memory. When the training set is too large to accommodate the memory, the program cannot run.

2) Main steps:

A. Read File Information and count

B. Create a decision tree

    • If the sample set is empty, a tree node with zero information count is returned.
    • If the samples are of the same category, a leaf node is generated and returned.
    • Number of positive and negative samples on the computing node
    • If the attribute value only has the attribute of that category, a leaf node is generated and the type index is assigned.
    • If none of the above are used, select an attribute with the highest gain rate (the continuous attribute must use the gain rate discretization). Based on the value of the attribute, the sample set and the attribute set are newly defined to create the relevant subtree.

C. Post pruning (Estimation of pessimistic Error Rate)

D. Output Decision Tree

E. When removing a decision

Key points include Calculation of information gain rate, calculation of pessimistic error rate in post-event pruning, and construction of trees (divide and conquer)

Calculation Formula of information gain rate:

 

 

Pessimistic error pruning PEP algorithm: (The following knowledge is a simple guide. If you need to learn more, please read the books on related pruning algorithms. I have not studied this in depth, so I will not elaborate on it)

The basic idea of pruning metrics can be summarized as follows:
  • Assume that the original tree generated from the training dataset is T, and the number of instances at a leaf node is nt (t is the right subscript, the same as below). The number of error classifications is et;
  • The following formula defines the error rate of the Training Dataset:

Because the training dataset is used to generate both decision trees and trim trees, it is biased. The decision tree that uses it to trim is not the most accurate and best;
  • Therefore, Quinlan adds continuity correction in the error estimation measurement, and modifies the formula of error rate to the following formula:
  • In the same way, if s is one of the Subtrees of the tree T, the number of leaf nodes of the subtree is ls. The classification error rate of Tt is shown in the following formula:

In quantitative analysis, for simplicity, we replace the above error rate representation with the total number of errors, that is, the formula:

For the subtree Tt, the total number of classification errors is shown in the following formula:

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.