Python implements decision tree C4.5 algorithm (improved based on ID3), c4.5id3
I. Introduction
C4.5 is mainly improved based on ID3. ID3 selects the node with the largest information benefit value as the node. C4.5 introduces the new concept "information gain rate". C4.5 selects the attribute with the highest information gain rate as the tree node.
II. Information Gain
The above formula is used to calculate the information gain rate (ID3 knowledge point)
Iii. Information gain rate
The information gain rate is calculated by dividing the information gain value.
For example, the following formula calculates the value of the attribute "outlook:
Iv. Complete C4.5 code
From numpy import * from scipy import * from math import logimport operator # Calculate the Shannon entropy of the given data: def calcShannonEnt (dataSet): numEntries = len (dataSet) labelCounts ={}# category Dictionary (the category name is the key and the number of the classes is the value) for featVec in dataSet: currentLabel = featVec [-1] if currentLabel not in labelCounts. keys (): # type labelCounts [currentLabel] = 0; labelCounts [currentLabel] + = 1; shannonEnt = 0.0 for key in labelCounts: # obtain the entropy prob = float (labelCounts [key])/numEntries for each type # shannonEnt-= prob * log (prob, 2) return shannonEnt; # Return entropy # divide the dataSet def splitDataSet (dataSet, axis, value) according to the given features: retDataSet = [] for featVec in dataSet: # if featVec [axis] = value: # if the value is equal to the value of the axis column in the dataSet matrix, each row is a new list (excluding the axis data) using cedfeatvec = featVec [: axis] reducedFeatVec. extend (featVec [axis + 1:]) retDataSet. append (reducedFeatVec) return retDataSet # return the new matrix after classification # select the best dataSet Partitioning Method def chooseBestFeatureToSplit (dataSet): numFeatures = len (dataSet [0]) -1 # evaluate the number of attributes baseEntropy = calcShannonEnt (dataSet) bestInfoGain = 0.0; bestFeature =-1 for I in range (numFeatures ): # evaluate the information gain of all attributes featList = [example [I] for example in dataSet] uniqueVals = set (featList) # values of column I attributes (different values) number set newEntropy = 0.0 splitInfo = 0.0; for value in uniqueVals: # Calculate the entropy of each different value in column I * their probability subDataSet = splitDataSet (dataSet, I, value) prob = len (subDataSet)/float (len (dataSet) # calculate the probability newEntropy + = prob * calcShannonEnt (subDataSet) in the I column attribute) # Calculate the sum of entropy for each attribute value in column I (splitInfo-= prob * log (prob, 2); infoGain = (baseEntropy-newEntropy)/splitInfo; # obtain the information gain rate print infoGain for column I attributes; if (infoGain> bestInfoGain): # Save the information gain rate value with the highest information gain rate and the table below (column value I) bestInfoGain = infoGain bestFeature = I return bestFeature # Find the most frequently used category name def majorityCnt (classList): classCount ={} for vote in classList: if vote not in classCount. keys (): classCount [vote] = 0 classCount [vote] + = 1 sortedClassCount = sorted (classCount. iteritems (), key = operator. itemgetter (1), reverse = True) return sortedClassCount [0] [0] # create a tree def createTree (dataSet, labels ): classList = [example [-1] for example in dataSet]; # create a list of training data for creating a tree (for example, the outermost list is [N, N, Y, y, Y, N, Y]) if classList. count (classList [0]) = len (classList): # If all training data belongs to a category, return classList [0]; if (len (dataSet [0]) = 1): # training data only provides category data (no attribute value data is given ), return the name of the category with the most occurrences: return majorityCnt (classList); bestFeat = chooseBestFeatureToSplit (dataSet); # select the attribute with the highest information gain for score (the returned value is the subscript of the attribute type list) bestFeatLabel = labels [bestFeat] # Find the attribute name based on the following table. When the root node of the tree is myTree = {bestFeatLabel: {}}# create an empty tree del (labels [bestFeat]) with bestFeatLabel as the root node # Remove the attribute featValues = [example [bestFeat] for example in dataSet] that has been selected as the root node from the attribute list # Find out the values of all training data for this attribute (create a list) uniqueVals = set (featValues) # obtain all worthy sets of this attribute (the element of the set cannot be repeated) for value in uniqueVals: # Calculate the subLabels = labels [:] myTree [bestFeatLabel] [value] = createTree (splitDataSet (dataSet, bestFeat, value), subLabels) of each branch of the tree based on the value of this attribute) # recursively create a tree based on each branch return myTree # generated tree # practical decision tree classification def classify (inputTree, featLabels, testVec): firstStr = inputTree. keys () [0] secondDict = inputTree [firstStr] featIndex = featLabels. index (firstStr) for key in secondDict. keys (): if testVec [featIndex] = key: if type (secondDict [key]). _ name _ = 'dict ': classLabel = classify (secondDict [key], featLabels, testVec) else: classLabel = secondDict [key] return classLabel # Read the training data in the data document (generate a two-dimensional list) def createTrainData (): lines_set = open ('.. /data/ID3/Dataset.txt '). readlines () labelLine = lines_set [2]; labels = labelLine. strip (). split () lines_set = lines_set [4: 11] dataSet = []; for line in lines_set: data = line. split (); dataSet. append (data); return dataSet, labels # Read test data in the data document (generate a two-dimensional list) def createTestData (): lines_set = open ('.. /data/ID3/Dataset.txt '). readlines () lines_set = lines_set [15: 22] dataSet = []; for line in lines_set: data = line. strip (). split (); dataSet. append (data); return dataSetmyDat, labels = createTrainData () myTree = createTree (myDat, labels) print myTreebootList = ['outlook ', 'temperature', 'humidity ', 'windy ']; testList = createTestData (); for testData in testList: dic = classify (myTree, bootList, testData) print dic
V. Differences between C4.5 and ID3 Codes
For example, lines 52nd and 53 of C4.5 code are different from ID3 (information gain is obtained by ID3, and information gain rate is obtained by C4.5 ).
Vi. training and testing dataset example
Training set: outlook temperature humidity windy kernel sunny hot high false N sunny hot high true N overcast hot high false Y rain mild high false Y rain cool normal true N overcast cool normal true Y test Set outlook temperature humidity windy cold sunny mild high false sunny cool normal false rain mild normal false sunny mild normal true overcast mild high true overcast hot normal false rain mild high true
The above python Implementation Decision Tree C4.5 algorithm (improved on the basis of ID3) is all the content that I have shared with you. I hope to give you a reference and support for the help house.