Decision Tree of classification algorithm

Last Update:2018-05-23 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Classification problem, mainly introduces decision tree algorithm, naive Bayesian, support vector machine, BP neural network, lazy learning algorithm, random forest and adaptive Enhancement Algorithm, classification model selection and result evaluation.

I. Basic introduction of Classification

Birds of a feather, flock together, classification problems have appeared in our lives since ancient time. Classification is an important branch of data mining, which has been widely used in various aspects, such as medical disease discrimination, spam filtering, spam SMS interception, customer analysis and so on. Classification problems can be divided into two categories:

Collation: Classification refers to the separation of discrete data, for example, according to a person's handwriting to determine whether this is a male or female, there are only two categories, the category is a discrete set of space {male, female}.
Prediction: Prediction refers to the classification of continuous data, such as predicting the humidity of the 8-point weather tomorrow, the weather is changing at any time, the weather at 8 o'clock is a specific value, it does not belong to a limited set of space. Prediction is also called regression analysis, which is widely used in the financial field.

Although the processing of discrete data and continuous data is different, in fact, they transform each other, for example, we can judge according to a certain characteristic value of comparison, If the value is greater than 0.5 is identified as a male, less than or equal to 0.5 is considered to be female, which translates into continuous processing, the weather humidity is divided into discrete data processing.

The data classification is divided into two steps:

Constructs the model, uses the training data set to train the classifier;
The test data is classified by the established classifier model.

A good classifier has a good generalization ability, that is, it can not only achieve a high accuracy in training data set, but also can achieve high accuracy in the test data set that has not been seen. If a classifier is performing well on the training data, but on the test data, the classifier has been fitted, it just notes the training data, and does not capture the entire data space features.

Ii. Classification of decision Trees

The decision tree algorithm is classified by the branch structure of the tree. is an example of a decision tree, where the inner node of the tree represents the judgment of a property, and the branch of the node is the corresponding judgment result; the leaf node represents a class label.

The table above is a decision tree that predicts whether a person will buy a computer, use this tree, we can classify the new records, starting from the root node (age), if a person's age is middle-aged, we directly determine that the person will buy a computer, if it is a teenager, you need to further determine whether the student If older, it is necessary to further determine the credit rating until the leaf node can determine the category of records.

One of the advantages of decision tree algorithm is that it can produce rules that people can understand directly, which is the characteristic of Bayesian, neural network and other algorithms, and the accuracy of decision tree is higher, and it is a very effective algorithm to classify it without knowledge of background. The decision tree algorithm has many variants, including ID3, C4.5, C5.0, cart, etc., but its foundation is similar. Here's a look at the basic idea of the decision Tree algorithm:

Algorithm: Generatedecisiontree (d,attributelist) generates a decision tree based on the training data record D.
Input:
- Data record D, contains the training data set of class standard;
- Attribute list AttributeList, a set of candidate attributes, used to make judgments in internal nodes.
- Attribute Selection Method Attributeselectionmethod (), select the method of the best classification attribute.
Output: a decision tree.
procedure:
1. constructs a node n;
2. if all records in data record d have the same class label (in Class C):
  - marks node n as a leaf node as C and returns node n;
  li>
3. If the property list is empty:
  - the node n is marked as the leaf node as the class with the most classes in D and the node n is returned;
4. call Attributeselectionmethod (d,a ttributelist) Select the best division criterion splitcriterion;
5. marks node n as the best splitting criterion splitcriterion;
6. if the split attribute value is discrete and allows the decision tree to split in multiple forks:
  - Subtracts the split attribute from the attribute list, attributelsit-= Splitattribute;
7. For each value of the split attribute J:
  - record set of records satisfying J for DJ;
  - If the DJ is empty:
    - creates a new leaf node F, marks the class with the most class in D, and hangs the node f at N;
  - Otherwise:
    - recursively calls Generatedecisiontree (Dj,attributelist) to get the subtree node NJ, which will hang NJ under N;
8. return node n;

The algorithm of the 1, 2, 3 steps are obviously, the 4th step of the best attribute selection function will be introduced in the following, now only know that it can find a criterion, so that according to the decision node to get the sub-tree category as pure as possible, where the pure is only a class, the 5th step according to the Division rule Set Node n test expression In the 6th step, when constructing a multi-fork decision tree, the discrete attribute is used only once in the node n and its subtree, and is then deleted from the list of available attributes. For example, in the previous diagram, using the attribute selection function, determine the best split attribute is age, age has three values, each value corresponds to a branch, the later will not be used to the age of this attribute. The time complexity of the algorithm is O (k*| D|*log (| d|)), K is the number of attributes, | d| the number of records for record set D.

Three, attribute selection method

The attribute selection method always chooses the best attribute of the most divisive attribute, which is to make the category of records for each branch as pure as possible. It sorts the properties of all property lists by a standard to select the best properties. There are a lot of properties to choose from, and here I present three common methods: information gain (information gain), gain ratio (gain ratio), Gini index (index of Gini).

Information gain (information gain)

　　The information gain is based on the Shannon theory, and it finds that the properties R has the following characteristics: The information gain before and after the splitting of the attribute R is greater than the other properties. The information here is defined as follows:

where m represents the number of Class C in DataSet D, Pi represents the probability that any record in D belongs to CI, and pi= (the number of records in the collection of CI classes in D) is calculated/| d|). Info (d) indicates the amount of information required to separate the classes of DataSet D from each other.

If you know the information theory, you will know that the above info info is actually entropy entropy, entropy is the measurement of uncertainty, if the category of a dataset is more uncertain, the entropy is greater. For example, we throw a cube a into the air, and remember that the surface of the ground is the f1,f1 value of {1,2,3,4,5,6},f1 's entropy entropy (F1) =-(1/6*log (1/6) +...+1/6*log (1/6)) =-1*log (1/6) = 2.58; Now we change the cube A to the positive tetrahedron B, when the ground surface for F2,F2 is the value of {1,2,3,4},f2 's entropy entropy (1) =-(1/4*log (quarter) +1/4*log (quarter) +1/4*log (1/4) +1/4* Log (section)) =-log (1/4) = 2; If we change to a ball C, the ground surface is F3, obviously no matter how to throw the ground is the same face, that is, the F3 value is {1}, so its entropy entropy (F3) =-1*log (1) = 0. You can see that the more polygons, the greater the entropy, and when there is only one face of the ball, the entropy value is 0, at this point, the uncertainty is 0, that is, the ground downward surface is determined.

With this simple understanding of entropy, we go on to talk about information gain. Suppose we choose the attribute R as the split attribute, DataSet D, R has K different values {v1,v2,..., Vk}, so d according to the value of R into K-group {d1,d2,..., Dk}, after splitting by R, the amount of information required to separate the different classes of DataSet D is:

Information gain is defined as before and after the split, two of the amount is only poor:

Information gain gain (R) means that the property R gives the information about the classification, and we look for the maximum properties of gain to make the classification as pure as possible, that is, the most likely to separate the different classes. However, we find that the attribute info (d) is the same, so the maximum gain can be converted to the smallest infor (d). Info (d) is introduced here just to illustrate the rationale behind it, to facilitate understanding, and we do not need to calculate info (d) when implemented. For example, data set D is as follows:

Record ID	Age	Input hierarchy	Students	Credit rating	Whether to buy a computer
1	Teenagers	High	Whether	So so	Whether
2	Teenagers	High	Whether	Good	Whether
3	Middle	High	Whether	So so	Is
4	Elderly	In	Whether	So so	Is
5	Elderly	Low	Is	So so	Is
6	Elderly	Low	Is	Good	Whether
7	Middle	Low	Is	Good	Is
8	Teenagers	In	Whether	So so	Whether
9	Teenagers	Low	Is	So so	Is
10	Elderly	In	Is	So so	Is
11	Teenagers	In	Is	Good	Is
12	Middle	In	Whether	Good	Is
13	Middle	High	Is	So so	Is
14	Elderly	In	Whether	Good	Whether

This data set is based on a person's age, income, whether students and credit rating to determine whether he will buy a computer, that is, the last column "whether to buy a computer" is the class label. Now we use the information gain to select the best classification attribute to calculate the amount of data divided by age:

The whole equation was summed up in three items, the first being young people, 14 of the records were 5 youngsters, of which 2 (2/5) purchased Computers and 3 (3/5) did not buy computers. The second is middle age and the third is old age. Similarly, there are:

It can be concluded that info age (D) is the smallest, that is, after the age division, the result of the sub-standard is the most pure, when the age as the root node of the test properties, according to adolescents, middle-aged, old age divided into three branches:

Note that the age of this attribute after use, then the operation will not need age, that is, the age from the AttributeList deleted. In the same way, the D1,D2,D3 corresponding decision subtree is built in the future. The ID3 algorithm uses the choice attribute method based on the information gain.

Gain ratio (gain ratio)

　　There is a big flaw in the method of information gain selection, it always tends to choose attributes with multiple attribute values, if we add a name attribute to the above data record, assuming that each person's name is different in the 14 record, the information gain will choose the name as the best attribute, because each group contains only one record after splitting by name , and each record belongs to only one category (either buying a computer or not buying it), so the purity is highest, and the name as the test split node has 14 branches below. But such a classification has no meaning, it has no generalization ability. This is improved by the gain ratio, which introduces a split message:

The gain ratio is defined as the ratio of information gain to split information:

We look for the Gainratio maximum attribute as the best split attribute. If a property has many values, then Splitinfor (D) is large, which makes Gainratio (R) smaller. However, the gain ratio also has shortcomings, splitinfo (d) may take 0, there is no computational significance at this time, and when Splitinfo (d) tends to 0 o'clock, the value of Gainratio (R) becomes untrustworthy, the improved measure is to add a smooth denominator, here plus an average of all the split information:

Gini index (Gini index)

　　The Gini index is another method of measuring the purity of the data, which is defined as follows:

The M still represents the number of category C in DataSet D, and Pi represents the probability that any record in D belongs to CI, and the number of records in Pi= (the set of CI classes in D is counted/| d|). If all records belong to the same class, then P1=1,gini (D) = 0, at which point the minimum purity is present. In the cart (classification and Regression tree) algorithm, the Gini index is used to construct a two-fork decision tree, and the non-empty true subset of its properties is enumerated for each attribute, with the Gini coefficients divided by the attribute R:

D1 is a non-empty true subset of D, D2 is a complement to D1 in D, that is, D1+d2=d, for property R, there are multiple true subsets, that is, Ginir (D) has multiple values, but we choose the smallest value as the Gini index of R. At last:

We turn the Gini (R) increment to the maximum attribute as the best split attribute.

Reference Link: http://www.cnblogs.com/fengfenggirl/p/classsify_decision_tree.html

Decision Tree of classification algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More