Practical notes for machine learning 3 (decision tree)

Source: Internet
Author: User
Tags id3

The advantage of decision tree is that the data format is very easy to understand, and the biggest drawback of KNN is that it cannot give the internal meaning of the data.

1: simple concept description

There are many types of decision trees, including cart, ID3, and C4.5. Among them, cart is based on Gini non-purity (Gini). Here we will not explain it in detail, while ID3 and C4.5 are both based on information entropy, both of them get the same result. This definition is mainly for the ID3 algorithm. Next we will introduce the definition of information entropy.

The probability of event AI occurrence is expressed by P (AI), while-log2 (P (AI) represents the uncertainty of event AI, called the self-information of AI, sum (P (AI) * I (AI) is called the average information entropy of the source S.

The principle of ID3 is to maximize the gain Based on Information Entropy. The tags of the original problem are positive and negative, and P and N represent the corresponding number. The information entropy of the original problem is

Where N is the number of values for this feature, for example, {rain, sunny}, then n is 2

Gain = baseentropy-newentropy

The principle of ID3 is even if gain reaches the maximum value. Information gain is the reduction of entropy or the reduction of Data disorder.

2: Python code implementation

(1) Calculate information entropy

# Calculate the information entropy def calcshannonent (Dataset): numentries = Len (Dataset) labelcounts ={} for featvec in Dataset: currentlabel = featvec [-1] If currentlabel not in labelcounts. keys (): # create a dictionary labelcounts [currentlabel] = 0 labelcounts [currentlabel] + = 1 shannonent = 0.0 for key in labelcounts for all possible categories: prob = float (labelcounts [Key])/numentries shannonent-= prob * log (prob, 2) # returns shannonent based on 2

(2) create a dataset

# Create data def createdataset (): DataSet = [[, 'yes'], [, 'yes'], [, 'no, 'No'], [0, 1, 'no'] labels = ['no surfacing', 'flippers'] Return dataset, labels

(3) Dividing Datasets

# Dataset Axis Based on features indicates the number of feature values that indicate that the value corresponding to the feature returns the partitioned dataset def splitdataset (dataset, axis, value ): retdataset = [] for featvec in Dataset: If featvec [axis] = value: reducedfeatvec = featvec [: Axis] reducedfeatvec. extend (featvec [axis + 1:]) retdataset. append (reducedfeatvec) return retdataset


(4) select the best features for classification

# Selecting the best dataset (feature) division method returns the best feature subscript def choosebestfeaturetosplit (Dataset): numfeatures = Len (Dataset [0])-1 # number of features baseentropy = calcshannonent (Dataset) bestinfogain = 0.0; bestfeature =-1 for I in range (numfeatures): # traverse feature I featureset = set ([example [I] For example in dataset]) # I feature value set newentropy = 0.0 for value in featureset: subdataset = splitdataset (dataset, I, value) prob = Len (subdataset)/float (LEN (Dataset )) newentropy + = prob * calcshannonent (subdataset) # entropy infogain = baseentropy-newentropy if infogain> bestinfogain: Signature = infogain bestfeature = I return bestfeature

Note: The following two methods must be met for the dataset:

<1> All column elements must have the same data length.

<2> The last column of data or the last element of each instance is the category label of the current instance.

(5) tree creation code

Python uses the dictionary type to store the tree structure. The returned result is mytree-dictionary.

# The function code for creating a tree in python uses the dictionary type to store the tree structure. The returned result is mytree-dictionary def createtree (dataset, labels ): classlist = [example [-1] For example in dataset] If classlist. count (classlist [0]) = Len (classlist ): # If the categories are identical, stop dividing and return the class label-leaf node return classlist [0] If Len (Dataset [0]) = 1: Return majoritycnt (classlist) # bestfeat = choosebestfeaturetosplit (Dataset) bestfeatlabel = labels [bestfeat] mytree = {bestfeatlabel :{}} del (labels [bestfeat]) returned when all features are traversed. featvalues = [example [bestfeat] For example in dataset] # The obtained list contains all attribute values uniquevals = set (featvalues) for value in uniquevals: sublabels = labels [:] mytree [bestfeatlabel] [value] = createtree (splitdataset (dataset, bestfeat, value), sublabels) return mytree

Recursion ends. If only the tags in the category are identical or all features are traversed


When all features are used up, the majority voting method is used to determine the classification of the leaf node. The Code is as follows:

# Most voting methods determine the classification of leaf nodes-when all features are used up, they still belong to multiple classes of Def majoritycnt (classlist): classcount ={} for vote in classlist: if vote not in classcount. key (): classcount [vote] = 0; classcount [vote] + = 1 sortedclasscount = sorted (classcount. iteritems (), Key = Operator. itemgetter (1), reverse = true) # Return sortedclasscount in the sorting function operator [0] [0]

That is, if the dataset has processed all the attributes, but the class label is still not unique, we need to decide how to define the leaf node. In this case, we usually use the majority voting method to determine the classification of the leaf node.

(6) use decision trees to perform classification

# Use a decision tree to execute classification def classify (inputtree, featlabels, testvec): firststr = inputtree. keys () [0] seconddict = inputtree [firststr] featindex = featlabels. index (firststr) # The index method is used to find the index for key in seconddict for the first element matching the firststr variable in the current list. keys (): If testvec [featindex] = key: If type (seconddict [Key]). _ name _ = 'dict ': classlabel = classify (seconddict [Key], featlabels, testvec) else: classlabel = seconddict [Key] Return classlabel


It is important to note the concept of recursion.

(7) storage of Decision Trees

Constructing a decision tree is a time-consuming task. To save computing time, it is best to call the constructed decision tree each time a classification is executed. To solve this problem, we need to use the python module pickle to serialize the object. The serialized object can save the object on the disk and read it as needed.

# Def storetree (inputtree, filename) stored in the decision tree: # pickle serialized object. You can save the object import pickle fw = open (filename, 'w') pickle on the disk. dump (inputtree, Fw) FW. close () def grabtree (filename): # Read it as needed import pickle Fr = open (filename) return pickle. load (FR)


3: matplotlib Annotation

Matplotlib provides an annotation tool annotations, which can be used to add text annotations to data graphs. Annotations are usually used to interpret data.

I didn't understand this code, so I only gave the code in the book.

#-*-Coding: cp936-*-import matplotlib. pyplot as pltdecisionnode = dict (boxstyle = 'sawtooth ', Fc = '0. 8 ') leafnode = dict (boxstyle = 'round4', Fc = '0. 8 ') arrow_args = dict (arrowstyle =' <-') def plotnode (nodetxt, centerpt, parentpt, nodetype): createplot. ax1.annotate (nodetxt, xy = parentpt, xycoords = 'axes fraction', xytext = centerpt, textcoords = 'axes fraction', VA = 'center', HA = 'center ', bBox = nodetype, arrowprops = arrow_args) # Use text annotations to draw Tree node def createplot (): FIG = PLT. figure (1, facecolor = 'white') fig. CLF () createplot. ax1 = PLT. subplot (111, frameon = false) plotnode ('a demo-node', (0.5, 0.1), (0.1, 0.5), decisionnode) plotnode ('a leaf node ', (0.8, 0.1), (0.3, 0.8), leafnode) PLT. show () # obtain the number of leaf nodes and number of layers of the tree def getnumleafs (mytree): numleafs = 0 firststr = mytree. keys () [0] seconddict = mytree [firststr] for key in seconddict. keys (): If (type (seconddict [Key]). _ name _ = 'dict '): numleafs + = getnumleafs (seconddict [Key]) else: numleafs + = 1 return numleafsdef gettreedepth (mytree ): maxdepth = 0 firststr = mytree. keys () [0] seconddict = mytree [firststr] for key in seconddict. keys (): If (type (seconddict [Key]). _ name _ = 'dict '): thisdepth = 1 + gettreedepth (seconddict [Key]) else: thisdepth = 1 If thisdepth> maxdepth: maxdepth = thisdepth return maxdepth # update createplot code to get the entire tree def plotmidtext (cntrpt, parentpt, txtstring): xmid = (parentpt [0]-cntrpt [0]) /2.0 + cntrpt [0] ymid = (parentpt [1]-cntrpt [1])/2.0 + cntrpt [1] createplot. ax1.text (xmid, ymid, txtstring, VA = "center", HA = "center", rotation = 30) def plottree (mytree, parentpt, nodetxt ): # If the first key tells you what feat was split on numleafs = getnumleafs (mytree) # This determines the x width of this tree depth = gettreedepth (mytree) firststr = mytree. keys () [0] # The text label for this node shocould be this cntrpt = (plottree. xoff + (1.0 + float (numleafs)/2.0/plottree. totalw, plottree. yoff) plotmidtext (cntrpt, parentpt, nodetxt) plotnode (firststr, cntrpt, parentpt, decisionnode) seconddict = mytree [firststr] plottree. yoff = plottree. yoff-1.0/plottree. totald for key in seconddict. keys (): If type (seconddict [Key]). _ name __= = 'dict ': # test to see if the nodes are dictonaires, if not they are leaf nodes plottree (seconddict [Key], cntrpt, STR (key )) # recursion else: # It's a leaf node print the leaf node plottree. xoff = plottree. xoff + 1.0/plottree. totalw plotnode (seconddict [Key], (plottree. xoff, plottree. yoff), cntrpt, leafnode) plotmidtext (plottree. xoff, plottree. yoff), cntrpt, STR (key) plottree. yoff = plottree. yoff + 1.0/plottree. totald # if you do get a dictonary you know it's a tree, and the first element will be another dictdef createplot (intree): FIG = PLT. figure (1, facecolor = 'white') fig. CLF () axprops = dict (xticks = [], yticks = []) createplot. ax1 = PLT. subplot (111, frameon = false, ** axprops) # No ticks # createplot. ax1 = PLT. subplot (111, frameon = false) # ticks for demo puropses plottree. totalw = float (getnumleafs (intree) plottree. totald = float (gettreedepth (intree) plottree. xoff =-0.5/plottree. totalw; plottree. yoff = 1.0; plottree (intree, (0.5, 1.0), '') PLT. show ()

The index method is used to find the index returned by the first element in the current list that matches firststr.

4: Use decision trees to predict contact lens types



Note: 1: This note comes from books <machine learning practices>

2: KNN. py file and note data downloaded here (http://download.csdn.net/detail/lu597203933/7660737 ).

Small village chief source: http://blog.csdn.net/lu597203933 welcome to reprint or share, but please be sure to declare the source of the article. (Sina Weibo: small village chief Zack. Thank you !)

Practical notes for machine learning 3 (decision tree)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.