Main ideas:
0. Training Set Format: Feature 1, feature 2,... Feature N, category
1. Recursive representation of data structure dictionary with Python
2, ID3 the information gain of the calculation refers to the information gain of the class , so it is the entropy of calculating class each time
3, ID3 each time the optimal characteristics of the data division will consume characteristics
4, when the characteristics of consumption to a certain extent, may appear the same data instance, but the category is not the same situation, this time can not choose the best features and return-1;
So the outside is going to capture-1, otherwise Python will assume that the best feature is the last column (category)
#Coding=utf-8Importoperator fromMathImportLogImport TimeImportOS, sysImportstringdefCreateDataSet (traindatafile):Printtraindatafile DataSet= [] Try: Fin=Open (Traindatafile) forLineinchFin:line=Line.strip () cols= Line.split ('\ t') Row= [Cols[1], cols[2], cols[3], cols[4], cols[5], cols[6], cols[7], cols[8], cols[9], cols[10], Cols[0]] dataset.append (ROW)#Print Row except: Print 'Usage xxx.py Traindatafilepath outputtreefilepath'sys.exit () labels= ['Cip1','cip2','CIP3','CIP4','Sip1','sip2','SIP3','SIP4','Sport','Domain'] Print 'Datasetlen', Len (dataSet)returnDataSet, Labels#Calc Shannon entropydefcalcshannonent (dataSet): NumEntries=Len (dataSet) labelcounts= {} forFeavecinchDataset:currentlabel= Feavec[-1] #每次都是 Calculate the entropy of the category ifCurrentLabel not inchLabelcounts:labelcounts[currentlabel]=0 Labelcounts[currentlabel]+ = 1shannonent= 0.0 forKeyinchLabelcounts:prob= Float (Labelcounts[key])/numentries shannonent-= prob * log (prob, 2) returnshannonentdefSplitdataset (dataSet, axis, value): Retdataset= [] forFeatvecinchDataSet:ifFeatvec[axis] = =Value:reducedfeatvec=Featvec[:axis] Reducedfeatvec.extend (Featvec[axis+1:]) Retdataset.append (Reducedfeatvec)returnRetdatasetdefChoosebestfeaturetosplit (dataSet): Numfeatures= Len (dataset[0])-1#Last Col is labelBaseentropy =calcshannonent (dataSet) Bestinfogain= 0.0bestfeature=-1 forIinchRange (numfeatures): Featlist= [Example[i] forExampleinchDataSet] Uniquevals=Set (featlist) Newentropy= 0.0 forValueinchUniquevals:subdataset=Splitdataset (DataSet, I, value) prob= Len (subdataset)/float (len (dataSet)) Newentropy+ = Prob *calcshannonent (subdataset) Infogain= Baseentropy-newentropyifInfogain >Bestinfogain:bestinfogain=Infogain bestfeature=Ireturnbestfeature#feature is exhaustive, reture what want labeldefmajoritycnt (classlist): ClassCount= {} forVoteinchclasslist:ifVote not inchClasscount.keys (): Classcount[vote]=0 Classcount[vote]+ = 1returnMax (ClassCount)defcreatetree (DataSet, labels): Classlist= [Example[-1] forExampleinchDataSet]ifClasslist.count (Classlist[0]) ==len (classlist):#All data is the same label returnClasslist[0]ifLen (dataset[0]) = = 1:#All feature are exhaustive returnmajoritycnt (classlist) bestfeat=choosebestfeaturetosplit (dataSet) Bestfeatlabel=Labels[bestfeat]if(Bestfeat = =-1):#characteristics, but different categories, that is, the category is not related to the characteristics, randomly select the first category to do the classification results returnClasslist[0] Mytree={bestfeatlabel:{}}del(Labels[bestfeat]) featvalues= [Example[bestfeat] forExampleinchDataSet] Uniquevals=Set (featvalues) forValueinchUniquevals:sublabels=labels[:] Mytree[bestfeatlabel][value]=Createtree (Splitdataset (DataSet, bestfeat, value), sublabels)returnmytreedefMain (): Data,label= CreateDataSet (sys.argv[1]) T1=time.clock () mytree=createtree (data,label) T2=time.clock () fout= Open (Sys.argv[2],'W') Fout.write (str (mytree)) Fout.close ()Print 'Execute for', t2-T1if __name__=='__main__': Main ()
Python implements decision tree ID3 algorithm