1. What is a decision tree/judgment tree (decision tree)? The decision tree is a tree structure similar to a flowchart: where each internal node represents a test on an attribute, Each branch represents an attribute output, and each leaf node represents a class or class distribution. The topmost layer of the tree is the root node. 2. An important algorithm 3 in the classification method of machine learning. Structure decision tree Basic algorithm , &NB Sp 3.1 Entropy (entropy) concept: information and abstraction, how to measure? 1948, Shannon put forward the concept of "information entropy (entropy)" The information size and its uncertainty have a direct correlation department, to find out a very, very uncertain thing, or is something we know nothing about and need a lot of information ==> The amount of information is equal to the number of uncertainties Example: guess the World Cup, if you know nothing, guess how many times? The odds of each team winning are not equal bit (bit ) to measure the amount of information
The greater the uncertainty of the variable, the greater the Entropy 3.1 decision tree induction algorithm (ID3) 1970-1980, J.ross. Quinlan, ID3 algorithm selects the attribute to determine the node information acquisition (Information Gain): Gain (A) = Info (d)-infor_a (d) using a as a node classification How much information was obtained
Similarly, Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit_rating) =0.048 So, select Age as the first root node
Repeat... Algorithm:
- The tree begins with a single node representing the training sample (step 1).
- If the sample is in the same class, the node becomes a leaf and is labeled with that class (steps 2 and 3).
- Otherwise, the algorithm uses entropy-based measures called information gain as heuristic information, choosing the genus that best classifies the samples (step 6). This property becomes the "test" or "decision" attribute of the node (step 7). In this version of the algorithm,
- All properties are categorized, that is, discrete values. Continuous attributes must be discretized.
- For each known value of the test property, create a branch and divide the sample accordingly (step 8-10).
- The algorithm uses the same process to recursively form a sample decision tree on each partition. Once an attribute appears on a node, it is not necessary to consider it on any descendant of that node (step 13).
- The recursive partitioning step is stopped only if one of the following conditions is true:
- (a) All samples of a given node belong to the same class (steps 2 and 3).
- (b) No remaining attributes can be used to further divide the sample (step 4). In this case, the majority vote is used (step 5).
- This involves converting a given node to a leaf and marking it with the class in which the majority of the sample is located. replacement, it is possible to store knots
- The class distribution of the point sample.
- (c) Branching
- Test_attribute = a i do not have a sample (step 11). In this case, the majority of the classes in Samples
- Create a leaf (step 12)
3.1 other algorithms: c4.5: Quinlan classificatio N and Regression Trees (CART): (L. Breiman, J. Friedman, R. Olshen, C. Stone) &NB Sp Common: All greedy algorithms, top-down (top-down approach) Differences: Attribute selection measures are different: C4.5 (gain Ratio), CART (Gini index), ID3 (information Gain) 3.2 How do I handle the properties of a continuity variable? 4. Tree pruning leaves (avoid overfitting) 4.1 first pruning 4.2 after pruning 5. Decision tree Benefits: Intuitive, easy to understand, small-scale datasets effective   6. The disadvantage of the decision tree: processing continuous variable not good category more, error increased faster scale general code part: 1, operating environment #!/ usr/bin/python# encoding:utf-8 "" "@author : du min @contact:[email protected] @File : allelectronics.py@time : 2017/7/23 16:45 "" " from Sklearn.feature_extraction Import dictvectorizerimport csvfrom sklearn import treefrom sklearn import Preprocessingfrom Sklearn.externals.six import stringioimport pydotplus 2, load file Allelectronicsdata = open (' c:/users/007/desktop/ 01dtree/allelectronics.csv ') reader = Csv.reader (allelectronicsdata) headers = Next (reader) #取出标题 3, Extract Data FeatureList = []labellist = [] for row in Reader: labellist.append (Row[len (Row)-1]) #取最后一列 rowdict = {} for I in range (1, Len (row)-1): rowdict[headers[i]] = row[i] #将数据保存为字典, not including last column featurelist.append (rowdict) 4, data processing in the form of a convenient operation # non-numeric type becomes a numerical matrix VEC = Dictvectorizer () Dummyx = Vec.fit_ Transform (featurelist). ToArray () print ("Dummyx:" + str (DUMMYX)) print (Vec.get_feature_names ()) Print ("labellist:" + str (labellist)) # Yes and no are converted to 0 and 1lb = preprocessing. Labelbinarizer () Dummyy = Lb.fit_transform (labellist) print ("Dummyy:" + str (DUMMYY)) 5, model generation and paint # using decision Trees for classification CLF = Tree. Decisiontreeclassifier () CLF = tree. Decisiontreeclassifier (criterion= ' entropy ') CLF = Clf.fit (Dummyx, dummyy) print ("CLF:" + str (CLF)) # generate decision Tree PDF dot _data = Tree.export_graphviz (CLF, out_file=none) graph = pydotplus.graph_from_dot_data (dot_data) graph.write_pdf (" Iris.pdf ") 6, validate model # Modify the first row of data, test the model result onerowx = dummyx[0,:]print (" ONEROWX: "+ str (ONEROWX)) newrowx = Onerowxnewrowx[0] = 1newrowx[2] = 0print ("NEWROWX:" + str (NEWROWX)) predictedy = Clf.predict (Newrowx.reshape (1, 1 )) #需要将数据reshape (1,-1) process print ("Predictedy:" + str (predictedy)) Result graph:
Decision tree Classification algorithm (ID3)