Decision tree Classification algorithm (ID3)

Last Update:2017-07-23 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is a decision tree/judgment tree (decision tree)? The decision tree is a tree structure similar to a flowchart: where each internal node represents a test on an attribute, Each branch represents an attribute output, and each leaf node represents a class or class distribution. The topmost layer of the tree is the root node. 2. An important algorithm 3 in the classification method of machine learning. Structure decision tree Basic algorithm , &NB Sp 3.1 Entropy (entropy) concept: information and abstraction, how to measure? 1948, Shannon put forward the concept of "information entropy (entropy)" The information size and its uncertainty have a direct correlation department, to find out a very, very uncertain thing, or is something we know nothing about and need a lot of information ==> The amount of information is equal to the number of uncertainties Example: guess the World Cup, if you know nothing, guess how many times? The odds of each team winning are not equal bit (bit ) to measure the amount of information

The greater the uncertainty of the variable, the greater the Entropy 3.1 decision tree induction algorithm (ID3) 1970-1980, J.ross. Quinlan, ID3 algorithm selects the attribute to determine the node information acquisition (Information Gain): Gain (A) = Info (d)-infor_a (d) using a as a node classification How much information was obtained

Similarly, Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit_rating) =0.048 So, select Age as the first root node

Repeat... Algorithm:

The tree begins with a single node representing the training sample (step 1).

If the sample is in the same class, the node becomes a leaf and is labeled with that class (steps 2 and 3).

Otherwise, the algorithm uses entropy-based measures called information gain as heuristic information, choosing the genus that best classifies the samples (step 6). This property becomes the "test" or "decision" attribute of the node (step 7). In this version of the algorithm,

All properties are categorized, that is, discrete values. Continuous attributes must be discretized.

For each known value of the test property, create a branch and divide the sample accordingly (step 8-10).

The algorithm uses the same process to recursively form a sample decision tree on each partition. Once an attribute appears on a node, it is not necessary to consider it on any descendant of that node (step 13).

The recursive partitioning step is stopped only if one of the following conditions is true:

(a) All samples of a given node belong to the same class (steps 2 and 3).

(b) No remaining attributes can be used to further divide the sample (step 4). In this case, the majority vote is used (step 5).

This involves converting a given node to a leaf and marking it with the class in which the majority of the sample is located. replacement, it is possible to store knots

The class distribution of the point sample.

(c) Branching

Test_attribute = a i do not have a sample (step 11). In this case, the majority of the classes in Samples

Create a leaf (step 12)

3.1 other algorithms: c4.5: Quinlan classificatio N and Regression Trees (CART): (L. Breiman, J. Friedman, R. Olshen, C. Stone) &NB Sp Common: All greedy algorithms, top-down (top-down approach) Differences: Attribute selection measures are different: C4.5 (gain Ratio), CART (Gini index), ID3 (information Gain) 3.2 How do I handle the properties of a continuity variable? 4. Tree pruning leaves (avoid overfitting) 4.1 first pruning 4.2 after pruning 5. Decision tree Benefits: Intuitive, easy to understand, small-scale datasets effective &NBSP;&NBSP;6. The disadvantage of the decision tree: processing continuous variable not good category more, error increased faster scale general code part: 1, operating environment #!/ usr/bin/python# encoding:utf-8 "" "@author : du min @contact:[email protected] @File : allelectronics.py@time : 2017/7/23 16:45 "" " from Sklearn.feature_extraction Import dictvectorizerimport csvfrom sklearn import treefrom sklearn import Preprocessingfrom Sklearn.externals.six import stringioimport pydotplus 2, load file Allelectronicsdata = open (' c:/users/007/desktop/ 01dtree/allelectronics.csv ') reader = Csv.reader (allelectronicsdata) headers = Next (reader) #取出标题 3, Extract Data FeatureList = []labellist = [] for row in Reader: labellist.append (Row[len (Row)-1]) #取最后一列 rowdict = {} for I in range (1, Len (row)-1): rowdict[headers[i]] = row[i] #将数据保存为字典, not including last column featurelist.append (rowdict) 4, data processing in the form of a convenient operation # non-numeric type becomes a numerical matrix VEC = Dictvectorizer () Dummyx = Vec.fit_ Transform (featurelist). ToArray () print ("Dummyx:" + str (DUMMYX)) print (Vec.get_feature_names ()) Print ("labellist:" + str (labellist)) # Yes and no are converted to 0 and 1lb = preprocessing. Labelbinarizer () Dummyy = Lb.fit_transform (labellist) print ("Dummyy:" + str (DUMMYY)) 5, model generation and paint # using decision Trees for classification CLF = Tree. Decisiontreeclassifier () CLF = tree. Decisiontreeclassifier (criterion= ' entropy ') CLF = Clf.fit (Dummyx, dummyy) print ("CLF:" + str (CLF)) # generate decision Tree PDF dot _data = Tree.export_graphviz (CLF, out_file=none) graph = pydotplus.graph_from_dot_data (dot_data) graph.write_pdf (" Iris.pdf ") 6, validate model # Modify the first row of data, test the model result onerowx = dummyx[0,:]print (" ONEROWX: "+ str (ONEROWX)) newrowx = Onerowxnewrowx[0] = 1newrowx[2] = 0print ("NEWROWX:" + str (NEWROWX)) predictedy = Clf.predict (Newrowx.reshape (1, 1 )) #需要将数据reshape (1,-1) process print ("Predictedy:" + str (predictedy)) Result graph:

Decision tree Classification algorithm (ID3)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More