Decision Tree Learning is one of the most widely used inductive reasoning algorithms, and is a method to approximate discrete-valued objective functions, and the functions learned in this method are represented as a decision tree. The decision tree can use unfamiliar collections of data and extract a set of rules from which the machine learning algorithm will eventually use the rules created from the data set. The advantages of the decision tree are: The computational complexity is not high, the output is easy to understand, the missing middle value is not sensitive, can process the irrelevant characteristic data. The disadvantage is that it may cause an over-matching problem. The decision tree is suitable for processing discrete and continuous data types.
The most important thing in a decision tree is how to select the features that are used for partitioning
ID3 is generally chosen in the algorithm,the core problem of the D3 algorithm is to select the characteristics or attributes to be tested in each node of the tree, and to choose the attributes that are most helpful to the classification instance. How to measure the value of an attribute quantitatively? The concept of entropy and information gain needs to be introduced here. Entropy is a metric widely used in information theory, and it depicts the purity of any sample set.
Suppose there are 10 training samples, of which 6 are classified as yes,4 labeled No, and what is the entropy? In this example, the number of classifications is 2 (yes,no), and the probability of Yes for 0.6,no is 0.4, then the entropy is:
where value (a) is a collection of all possible values of property A, is a subset of the value of V for attribute a in S , that is. The first item of the above formula is the entropy of the original set S, and the second one is the expected value of the S post-entropy, which is the weighted sum of the entropy of each subset, and the proportion of the sample that belongs to the original sample S . So gain (S, A) is reduced by the expected entropy resulting from knowing the value of attribute A.
The complete code:
#-*-coding:cp936-*-from numpy import *import operatorfrom Math import log import operatordef createdataset (): DA Taset = [[[] [[], ' yes '], [[], ' yes '], [1,0, ' no '], [0,1, ' no '], [0,1, ' no ']] labels = [' No ' Surfacing ', ' flippers '] return DataSet, Labelsdef calcshannonent (DataSet): NumEntries = Len (DataSet) Labelco Unts = {} # A dictionary for feature for Featvec in Dataset:currentlabel = featvec[-1] If Currentla Bel not in Labelcounts.keys (): Labelcounts[currentlabel] = 0 Labelcounts[currentlabel] + = 1 sh annonent = 0.0 for key in labelcounts: #print (Key) #print (Labelcounts[key]) prob = float (l Abelcounts[key])/numentries #print (prob) shannonent-= prob * log (prob,2) return shannonent# according to the given feature division DataSet # Sets the data to Def splitdataset (dataset, axis, value) according to the characteristics of axis equals value: Retdataset = [] for Featvec in DataSet: If featvec[axis] = = VALue:reducedfeatvec = Featvec[:axis] Reducedfeatvec.extend (featvec[axis+1:]) Retdat Aset.append (Reducedfeatvec) return retdataset# Select features, divide datasets, and calculate the best feature def choosebestfeaturetosplit (DataSet) that divides the dataset: Numfeatures = Len (dataset[0])-1 #剩下的是特征的个数 baseentropy = calcshannonent (dataSet) #计算数据集的熵, put in Baseentropy best Infogain = 0.0;bestfeature =-1 #初始化熵增益 for I in Range (numfeatures): Featlist = [Example[i] For example in Da Taset] #featList存储对应特征所有可能得取值 uniquevals = set (featlist) newentropy = 0.0 for value in Uniqueval S: #下面是计算每种划分方式的信息熵, characteristics I, each characteristic value of Subdataset = Splitdataset (DataSet, I, value) prob = Len (subd Ataset)/float (len (dataSet)) #特征样本在总样本中的权重 newentropy = prob * Calcshannonent (subdataset) Infogain = b Aseentropy-newentropy #计算i个特征的信息熵 #print (i) #print (Infogain) if (Infogain > Bestinfogain): Bestinfogain = iNfogain bestfeature = i return bestfeature #如上面是决策树所有的功能模块 # after the original dataset is divided based on the best attribute values, each partition is passed to the next node of the tree branch # recursion The end condition is that the program iterates through all of the dataset properties, or that all instances under each branch have the same classification # if all instances have the same classification, then get a leaf node or terminate fast # If all properties have been processed, but the class label is still not deterministic, a majority vote is used #返回出现次数最多的分类名称 def majoritycnt (classlist): ClassCount = {} for vote in Classlist:if vote not in class Count.keys (): Classcount[vote] = 0 Classcount[vote] + = 1 Sortedclasscount = sorted (Classcount.iteritems (), Ke Y=operator.itemgetter (1), reverse=true) return sortedclasscount[0][0] #创建决策树def createtree (dataset,labels): CLA Sslist = [Example[-1] For example in DataSet] #将最后一行的数据放到classList中, values of all categories if Classlist.count (classlist[0]) = = Len (cla Sslist): #类别完全相同不需要再划分 return classlist[0] If Len (dataset[0]) = = 1: #这里为什么是1呢? That is, when the feature number is 1, return majoritycnt (classlist) #就返回这个特征就行了, because this one feature bestfeat = Choosebestfeaturetosplit (DataSet) Print (' The Bestfeatue in creating is: ') print (bestfeat) bEstfeatlabel = labels[bestfeat] #运行结果 ' no surfacing ' mytree = {bestfeatlabel:{}} #嵌套字典, currently value is an empty dictionary del (LABELS[BESTF Eat]) Featvalues = [Example[bestfeat] For example in DataSet] #第0个特征对应的取值 uniquevals = set (featvalues) for V Alue in uniquevals: #根据当前特征值的取值进行下一级的划分 sublabels = labels[:] mytree[bestfeatlabel][value] = Createtree (s Plitdataset (Dataset,bestfeat,value), sublabels) return mytree# Small test of the above simple Data Def testTree1 (): mydat,labels= CreateDataSet () val = calcshannonent (mydat) print ' The classify accuracy is:%.2f%% '% val retDataSet1 = spli Tdataset (mydat,0,1) print (mydat) print (retDataSet1) retDataSet0 = Splitdataset (mydat,0,0) print (mydat) PR Int (retDataSet0) bestfeature = Choosebestfeaturetosplit (mydat) print (' The Bestfeatue is: ') print (bestfeature ) tree = Createtree (mydat,labels) print (tree)
The corresponding results are:
>>> import tree>>> tree.testtree1 () The classify accuracy is:0.97%[[1, 1, ' yes '], [1, 1, ' yes '], [1, 0, ' No '], [0, 1, ' no '], [0, 1, ' No ']][[1, ' yes '], [1, ' yes '], [0, ' No ']][[1, 1, ' Yes ', [1, 1, ' yes '], [1, 0, ' no '], [0, 1, ' No '], [0, 1, ' No ']][[1, ' no '], [1, ' No ']]the bestfeatue is:0the bestfeatue in creating is:0the bestfeatue in creating are : 0{' no surfacing ': {0: ' No ', 1: {' flippers ': {0: ' No ', 1: ' Yes '}}}
It is best to increase the classification function using the decision tree
Also because building a decision tree is time-consuming, because it is best to serialize the constructed tree through Python's pickle and save the object in
On the disk, and then read it when needed
def classify (Inputtree,featlabels,testvec): firststr = Inputtree.keys () [0] seconddict = inputtree[firststr] Featindex = Featlabels.index (firststr) key = Testvec[featindex] valueoffeat = Seconddict[key] if Isinstance (Valueoffeat, dict): Classlabel = classify (Valueoffeat, Featlabels, Testvec) Else:classlabel = Valueoffeat return classlabeldef storetree (inputtree,filename): import pickle fw = open (filename, ' W ') pickle.dump (INPUTTREE,FW) fw.close () def grabtree (filename): import pickle fr = Open ( FileName) return Pickle.load (FR)
Machine learning Python Instance completion-decision tree