Python implements decision tree ID3 algorithm

Source: Internet
Author: User
Tags id3

Main ideas:

0. Training Set Format: Feature 1, feature 2,... Feature N, category

1. Recursive representation of data structure dictionary with Python

2, ID3 the information gain of the calculation refers to the information gain of the class , so it is the entropy of calculating class each time

3, ID3 each time the optimal characteristics of the data division will consume characteristics

4, when the characteristics of consumption to a certain extent, may appear the same data instance, but the category is not the same situation, this time can not choose the best features and return-1;

So the outside is going to capture-1, otherwise Python will assume that the best feature is the last column (category)

#Coding=utf-8Importoperator fromMathImportLogImport TimeImportOS, sysImportstringdefCreateDataSet (traindatafile):Printtraindatafile DataSet= []    Try: Fin=Open (Traindatafile) forLineinchFin:line=Line.strip () cols= Line.split ('\ t') Row= [Cols[1], cols[2], cols[3], cols[4], cols[5], cols[6], cols[7], cols[8], cols[9], cols[10], Cols[0]] dataset.append (ROW)#Print Row    except:        Print 'Usage xxx.py Traindatafilepath outputtreefilepath'sys.exit () labels= ['Cip1','cip2','CIP3','CIP4','Sip1','sip2','SIP3','SIP4','Sport','Domain']    Print 'Datasetlen', Len (dataSet)returnDataSet, Labels#Calc Shannon entropydefcalcshannonent (dataSet): NumEntries=Len (dataSet) labelcounts= {}     forFeavecinchDataset:currentlabel= Feavec[-1] #每次都是 Calculate the entropy of the category         ifCurrentLabel not inchLabelcounts:labelcounts[currentlabel]=0 Labelcounts[currentlabel]+ = 1shannonent= 0.0 forKeyinchLabelcounts:prob= Float (Labelcounts[key])/numentries shannonent-= prob * log (prob, 2)    returnshannonentdefSplitdataset (dataSet, axis, value): Retdataset= []     forFeatvecinchDataSet:ifFeatvec[axis] = =Value:reducedfeatvec=Featvec[:axis] Reducedfeatvec.extend (Featvec[axis+1:]) Retdataset.append (Reducedfeatvec)returnRetdatasetdefChoosebestfeaturetosplit (dataSet): Numfeatures= Len (dataset[0])-1#Last Col is labelBaseentropy =calcshannonent (dataSet) Bestinfogain= 0.0bestfeature=-1 forIinchRange (numfeatures): Featlist= [Example[i] forExampleinchDataSet] Uniquevals=Set (featlist) Newentropy= 0.0 forValueinchUniquevals:subdataset=Splitdataset (DataSet, I, value) prob= Len (subdataset)/float (len (dataSet)) Newentropy+ = Prob *calcshannonent (subdataset) Infogain= Baseentropy-newentropyifInfogain >Bestinfogain:bestinfogain=Infogain bestfeature=Ireturnbestfeature#feature is exhaustive, reture what want labeldefmajoritycnt (classlist): ClassCount= {}     forVoteinchclasslist:ifVote not inchClasscount.keys (): Classcount[vote]=0 Classcount[vote]+ = 1returnMax (ClassCount)defcreatetree (DataSet, labels): Classlist= [Example[-1] forExampleinchDataSet]ifClasslist.count (Classlist[0]) ==len (classlist):#All data is the same label        returnClasslist[0]ifLen (dataset[0]) = = 1:#All feature are exhaustive        returnmajoritycnt (classlist) bestfeat=choosebestfeaturetosplit (dataSet) Bestfeatlabel=Labels[bestfeat]if(Bestfeat = =-1):#characteristics, but different categories, that is, the category is not related to the characteristics, randomly select the first category to do the classification results        returnClasslist[0] Mytree={bestfeatlabel:{}}del(Labels[bestfeat]) featvalues= [Example[bestfeat] forExampleinchDataSet] Uniquevals=Set (featvalues) forValueinchUniquevals:sublabels=labels[:] Mytree[bestfeatlabel][value]=Createtree (Splitdataset (DataSet, bestfeat, value), sublabels)returnmytreedefMain (): Data,label= CreateDataSet (sys.argv[1]) T1=time.clock () mytree=createtree (data,label) T2=time.clock () fout= Open (Sys.argv[2],'W') Fout.write (str (mytree)) Fout.close ()Print 'Execute for', t2-T1if __name__=='__main__': Main ()

Python implements decision tree ID3 algorithm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.