Python implementation of decision tree and python implementation of decision tree
Advantages and disadvantages of Decision Tree algorithms:
- Advantages: low computing complexity, easy to understand output results, insensitive to missing median values, and the ability to process irrelevant feature data
- Disadvantage: the problem of over-matching may occur.
- Applicable data types: numeric and nominal
Algorithm idea: 1. The overall idea of decision tree construction:
To put it bluntly, the decision tree is like the if-else structure. The result is that you want to generate a tree that can continuously judge and select the leaf node from the root, however, the if-else here will not let us think that we should set it. What we need to do is to provide a method for computers to obtain the decision tree we need based on this method. The focus of this method is how to select valuable from so many features and choose from root to leaf in the best order. After this is done, we can recursively construct a decision tree.
2. Information Gain
The principle of dividing a dataset is to make disordered data more orderly. Since this involves the order and disorder of information, we naturally need to think of the information entropy. Here we use information entropy (another method is the non-purity of Gini ). The formula is as follows:
Data requirements:
1. Data must be a list composed of list elements, and all columns must have the same data length.
2. The last column of data or the last element of each instance should be the category tag of the current instance.
Function:
calcShannonEnt(dataSet)
Calculate the Shannon entropy of a dataset in two steps. The first step is to calculate the frequency. The second step is to calculate the Shannon entropy based on the formula.
splitDataSet(dataSet, aixs, value)
Divide a dataset. All values that meet X [aixs] = value are grouped together, and a well-divided set is returned (excluding the aixs attribute used for Division because it is not required)
chooseBestFeature(dataSet)
When selecting the best attribute for division, the idea is very simple: divide each attribute to see which one is better. Here we use a set to select the unique element in the list. This is a very fast method.
majorityCnt(classList)
Because recursive decision tree construction is calculated based on the consumption of attributes, the final attribute may be used up, but the classification is still not completed. In this case, the node classification will be calculated by majority voting.
createTree(dataSet, labels)
Build a decision tree based on recursion. The label here is more about the name of the classification feature, for better understanding and later.
1 # coding = UTF-8 2 import operator 3 from math import log 4 import time 5 6 def createDataSet (): 7 dataSet = [[, 'yes'], 8, 'Yes'], 9 [, 'no'], 10 [, 'no'], 11, 'No'] 12 labels = ['no surfaceing', 'flippers'] 13 return dataSet, labels14 15 # Calculate Shannon entropy 16 def calcShannonEnt (dataSet ): 17 numEntries = len (dataSet) 18 labelCounts = {} 19 for feaVec in dataSet: 20 currentLabel = feaVec [-1] 21 if currentLabel not in labelCounts: 22 labelCounts [currentLabel] = 023 labelCounts [currentLabel] + = 124 shannonEnt = 0.025 for key in labelCounts: 26 prob = float (labelCounts [key]) /numEntries27 shannonEnt-= prob * log (prob, 2) 28 return shannonEnt29 30 def splitDataSet (dataSet, axis, value): 31 retDataSet = [] 32 for featVec in dataSet: 33 if featVec [axis] = value: 34 reducedFeatVec = featVec [: axis] 35 reducedFeatVec. extend (featVec [axis + 1:]) 36 retDataSet. append (reducedFeatVec) 37 return retDataSet38 39 def chooseBestFeatureToSplit (dataSet): 40 numFeatures = len (dataSet [0]) -1 # because the last item of the dataSet is tag 41 baseEntropy = calcShannonEnt (dataSet) 42 bestInfoGain = 0.043 bestFeature =-144 for I in range (numFeatures ): 45 featList = [example [I] for example in dataSet] 46 uniqueVals = set (featList) 47 newEntropy = 0.048 for value in uniqueVals: 49 subDataSet = splitDataSet (dataSet, I, value) 50 prob = len (subDataSet)/float (len (dataSet) 51 newEntropy + = prob * calcShannonEnt (subDataSet) 52 infoGain = baseEntropy-newEntropy53 if infoGain> bestInfoGain: 54 bestInfoGain = infoGain55 bestFeature = i56 return bestFeature57 58 # because the recursive decision tree is calculated based on the consumption of the attribute, the final attribute may be used up, however, classification 59 # is still not completed. in this case, the 60 def majorityCnt (classList): 61 classCount ={} 62 for vote in classList will be calculated by a majority vote: 63 if vote not in classCount. keys (): 64 classCount [vote] = 065 classCount [vote] + = 166 return max (classCount) 67 68 def createTree (dataSet, labels ): 69 classList = [example [-1] for example in dataSet] 70 if classList. count (classList [0]) = len (classList): # if the category is the same, stop dividing 71. return classList [0] 72 if len (dataSet [0]) = 1: # All features have been used up 73 return majorityCnt (classList) 74 bestFeat = chooseBestFeatureToSplit (dataSet) 75 bestFeatLabel = labels [bestFeat] 76 myTree = {bestFeatLabel: {}} 77 del (labels [bestFeat]) 78 featValues = [example [bestFeat] for example in dataSet] 79 uniqueVals = set (featValues) 80 for value in uniqueVals: 81 subLabels = labels [:] # In order not to change the content of the original list, copy 82 myTree [bestFeatLabel] [value] = createTree (splitDataSet (dataSet, 83 bestFeat, value ), subLabels) 84 return myTree85 86 def main (): 87 data, label = createDataSet () 88 t1 = time. clock () 89 myTree = createTree (data, label) 90 t2 = time. clock () 91 print myTree92 print 'execute for ', t2-t193 if _ name __= =' _ main _ ': 94 main ()
From Weizhi note (Wiz)