Decision tree Classification algorithm (ID3)

Source: Internet
Author: User
Tags id3

1. What is a decision tree/judgment tree (decision tree)?           The decision tree is a tree structure similar to a flowchart: where each internal node represents a test on an attribute, Each branch represents an attribute output, and each leaf node represents a class or class distribution. The topmost layer of the tree is the root node.  2.   An important algorithm  3 in the classification method of machine learning.   Structure decision tree Basic algorithm                                                        ,         &NB Sp                                    3.1  Entropy (entropy) concept:           information and abstraction, how to measure?           1948, Shannon put forward the concept of "information entropy (entropy)"           The information size and its uncertainty have a direct correlation department, to find out a very, very uncertain thing, or                    is something we know nothing about and need a lot of information ==> The amount of information is equal to the number of uncertainties                     Example: guess the World Cup, if you know nothing, guess how many times?          The odds of each team winning are not equal                     bit (bit ) to measure the amount of information          

The greater the uncertainty of the variable, the greater the Entropy 3.1 decision tree induction algorithm (ID3) 1970-1980, J.ross. Quinlan, ID3 algorithm selects the attribute to determine the node information acquisition (Information Gain): Gain (A) = Info (d)-infor_a (d) using a as a node classification How much information was obtained

Similarly, Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit_rating) =0.048 So, select Age as the first root node

Repeat... Algorithm:
  • The tree begins with a single node representing the training sample (step 1).
  • If the sample is in the same class, the node becomes a leaf and is labeled with that class (steps 2 and 3).
  • Otherwise, the algorithm uses entropy-based measures called information gain as heuristic information, choosing the genus that best classifies the samples (step 6). This property becomes the "test" or "decision" attribute of the node (step 7). In this version of the algorithm,
  • All properties are categorized, that is, discrete values. Continuous attributes must be discretized.
  • For each known value of the test property, create a branch and divide the sample accordingly (step 8-10).
  • The algorithm uses the same process to recursively form a sample decision tree on each partition. Once an attribute appears on a node, it is not necessary to consider it on any descendant of that node (step 13).
  • The recursive partitioning step is stopped only if one of the following conditions is true:
  • (a) All samples of a given node belong to the same class (steps 2 and 3).
  • (b) No remaining attributes can be used to further divide the sample (step 4). In this case, the majority vote is used (step 5).
  • This involves converting a given node to a leaf and marking it with the class in which the majority of the sample is located. replacement, it is possible to store knots
  • The class distribution of the point sample.
  • (c) Branching
  • Test_attribute = a i do not have a sample (step 11). In this case, the majority of the classes in Samples
  • Create a leaf (step 12)
                       3.1 other algorithms:               c4.5:  Quinlan               classificatio N and Regression Trees (CART): (L. Breiman, J. Friedman, R. Olshen, C. Stone)             &NB Sp   Common: All greedy algorithms, top-down (top-down approach)                 Differences: Attribute selection measures are different: C4.5 (gain Ratio), CART (Gini index), ID3 (information Gain)       3.2 How do I handle the properties of a continuity variable?   4. Tree pruning leaves (avoid overfitting)      4.1 first pruning      4.2 after pruning   5. Decision tree Benefits:      Intuitive, easy to understand, small-scale datasets effective       6. The disadvantage of the decision tree:      processing continuous variable not good       category more, error increased faster       scale general   code part: 1, operating environment #!/ usr/bin/python# encoding:utf-8   "" "@author   : du min @contact:[email protected] @File     : allelectronics.py@time    : 2017/7/23 16:45 "" " from Sklearn.feature_extraction Import dictvectorizerimport csvfrom sklearn import treefrom sklearn import Preprocessingfrom Sklearn.externals.six import stringioimport pydotplus 2, load file Allelectronicsdata = open (' c:/users/007/desktop/ 01dtree/allelectronics.csv ') reader = Csv.reader (allelectronicsdata) headers = Next (reader) #取出标题  3, Extract Data FeatureList = []labellist = [] for row in Reader:    labellist.append (Row[len (Row)-1])    #取最后一列     rowdict = {}    for I in range (1, Len (row)-1):         rowdict[headers[i]] = row[i]   #将数据保存为字典, not including last column      featurelist.append (rowdict)  4, data processing in the form of a convenient operation # non-numeric type becomes a numerical matrix VEC = Dictvectorizer () Dummyx = Vec.fit_ Transform (featurelist). ToArray ()  print ("Dummyx:" + str (DUMMYX)) print (Vec.get_feature_names ())  Print ("labellist:" + str (labellist))  # Yes and no are converted to 0 and 1lb = preprocessing. Labelbinarizer () Dummyy = Lb.fit_transform (labellist) print ("Dummyy:" + str (DUMMYY))  5, model generation and paint # using decision Trees for classification CLF = Tree. Decisiontreeclassifier () CLF = tree. Decisiontreeclassifier (criterion= ' entropy ') CLF = Clf.fit (Dummyx, dummyy) print ("CLF:" + str (CLF))  # generate decision Tree PDF dot _data = Tree.export_graphviz (CLF, out_file=none) graph = pydotplus.graph_from_dot_data (dot_data) graph.write_pdf (" Iris.pdf ")  6, validate model # Modify the first row of data, test the model result onerowx = dummyx[0,:]print (" ONEROWX: "+ str (ONEROWX))  newrowx = Onerowxnewrowx[0] = 1newrowx[2] = 0print ("NEWROWX:" + str (NEWROWX))  predictedy = Clf.predict (Newrowx.reshape (1, 1 )) #需要将数据reshape (1,-1) process print ("Predictedy:" + str (predictedy))   Result graph:

Decision tree Classification algorithm (ID3)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.