Back to Catalog
Prev: K-nearest neighbor algorithm
1. Introduction to Simple Theory
There are many types of decision trees, such as cart, ID3 and C4.5, in which the cart is based on the purity of the Gini (Gini), and the ID3 and C4.5 are based on information entropy, and the two results are the same, and the definition is mainly for the ID3 algorithm. Here we introduce the definition of information entropy.
1.1 Entropy
Set D to divide the training set by category, then the entropy of D (entropy) is expressed as:
where m represents the number of labels in the training set , and PI indicates the probability that the first class will appear in the entire training set, the number of elements belonging to this category can be divided by the total number of components of the training set as an estimate ;-LOG2 (P (i)) indicates the degree of uncertainty for event I, The amount of self-information called I . The actual meaning of entropy is that the average amount of information required for the label of the training set in D.
1.2 Information Gain
Now let's assume that training set D is divided by feature a, and the expected information for A to D is:
where v represents the number of values that are taken by feature a, | DJ| Indicates the number of training set elements when feature A is J | D| represents the total number of all elements in the training set.
The information gain (gain) obtained by dividing the training set D with feature A is:
principle of 1.3 ID3 decision tree
From the information theory knowledge, we know that the smaller the expectation, the greater the information gain, and the higher the purity. Therefore, the core idea of the ID3 algorithm is to select the information gain Metric feature and select the feature with the greatest information gain after splitting. Below we continue to use an example of an unreal account detection in the SNS community to illustrate how to construct a decision tree using the ID3 algorithm. For the sake of simplicity, let's assume that the training set consists of 10 elements.
where S, M, l indicate small, medium, and large respectively.
The total entropy of information is calculated first. There are a total of 10 elements, the label category has two, one is "Yes" (7 records), the account is real, the other is "No" (3 records), indicating that the account is not real. Then the total entropy is:
Set L, F, H, and r to indicate log density, friend density, whether to use real avatar and account number is true, the following separately calculate the information gain of each feature.
To calculate the log density , for the characteristic log density, its value has s (3 records), M(4 records ) and L(3 Records) three kinds, in the case of log density s, the account is the real record has 1, The account is not true record has 2, in the log density is M, the account is the real record has 3, the account untrue record has 1; under the condition of log density is L, the account is real record has 3, the account untrue record has 0 .
The information gain (gain) obtained by dividing the training set D with the feature log density is:
Similarly, gain (F) =0.553,gain (H) = 0.033 can be calculated. Since gain (f) is the largest, so that the First division is selected with F (Buddy density) for splitting characteristics, the results of the split are as follows:
The green node represents the judging condition, and the red nodes represent the decision result.
On the basis of this method, the splitting feature of sub-nodes is calculated recursively, and the whole decision tree can be obtained finally.
1.4 Advantages and disadvantages of decision trees
Pros: The output is easy to understand and really insensitive to intermediate values
Disadvantages: Easy to produce over fit
2. Code implementation2.1 Computational Entropy
The first step is to calculate the entropy of a given data set.
The From math import log# calculates the Entropy def calcshannonent (dataset) for a given dataset: numentries = Len (DataSet) labelcounts = {} for Featvec in DataSet: #the the number of unique elements and their occurance CurrentLabel = featvec[-1] if Currentla Bel not in Labelcounts.keys (): Labelcounts[currentlabel] = 0 Labelcounts[currentlabel] + = 1 shannonent = 0.0 For key in labelcounts: prob = float (Labelcounts[key])/numentries shannonent-= prob * log (prob,2) #log Ba SE 2 return shannonent
Test the method:
# Create Datasets def createdataset (): DataSet = [[1, 1, ' yes '], [1, 1, ' yes '], [1, 0, ' no '], [0, 1, ' no '], [0, 1 , ' no ']] labels = [' No surfacing ', ' flippers '] #change to discrete values return DataSet, Labelsmydat,labels = CreateDataSet () print ' Mydat: ', Mydatprint ' Entropy_mydat: ', Calcshannonent (Mydat)
Operation Result:
Mydat: [[1, 1, ' yes '], [1, 1, ' yes '], [1, 0, ' no '], [0, 1, ' no '], [0, 1, ' No ']]entropy_mydat:0.970950594455
2.2 Splitting data sets
We calculate the entropy of the data set in order to let the splitting data set. The first step is to find the split feature of the DataSet, and the second step is to split the dataset based on the split feature found. Here we discuss the second question, that is, assuming that the split feature has been found, how do you split the data set according to it?
# Splits a dataset according to a specified feature # DataSet: DataSet (MxN), axis: Index of the feature, that is, the number of features:, Value: The value of the selected feature # Returns a DataSet that is characterized by an axis index as a split feature, And the value of the split feature is given. def splitdataset (dataSet, axis, value): retdataset = [] for Featvec in DataSet: if featvec[axis] = = Value:
reducedfeatvec = Featvec[:axis] #chop out axis used for splitting reducedfeatvec.extend (featvec[axis+1:]) retdataset.append (Reducedfeatvec) return Retdataset
To test the method, use the previous data set:
print ' Splits the dataset with the No. 0 feature for the split feature, ' print ' splits a subset with a eigenvalues of 1: ', Splitdataset (mydat,0,1) print ' Split eigenvalues are subsets of 0: ', Splitdataset (Mydat, 0,0)
Operation Result:
Splitting the dataset with the No. 0 feature as a splitting feature, a subset of the split eigenvalues of 1: [[1, ' yes '], [1, ' yes '], [0, ' no '], and a subset of 0 of the split eigenvalues: [[1, ' No '], [1, ' no ']
Down to do is to find the split feature:
# returns the index of the split feature Def Choosebestfeaturetosplit (dataSet): numfeatures = Len (dataset[0])-The last column of the element in the # DataSet has the category label, so subtract 1 Baseentropy = Calcshannonent (dataSet) bestinfogain = 0.0;bestfeature = -1# Initialize for i in range (numfeatures): Featlist = [Element[i] for element in dataset]# to get all the values of the first I feature in the dataset uniquevals = Set (featlist) # to Featlist de-weight, Get the eigenvalues of the I feature set newentropy = 0.0 for value in uniquevals: subdataset = Splitdataset (dataset,i,value) Prob = Len (subdataset)/float (len (dataSet)) newentropy + = prob * Calcshannonent (subdataset) Infogain = Baseentropy-newentropy if (Infogain > Bestinfogain): bestinfogain = infogain bestfeature = i Return bestfeature
It is important to note that incoming datasets need to meet the following criteria:
1) The data must be a list of list elements, and all list elements must have the same data length
2) The last column of the data or the last element of each instance is the category label of the current instance.
Test the method:
The index of the print ' Split feature is: ', Choosebestfeaturetosplit (Mydat)
Operation Result:
The index of the print ' Split feature is: ', the index of the Choosebestfeaturetosplit (mydat) split feature is: 0
2.3 Recursive build decision tree
It is now possible to split the data set according to the split feature, and the next step is to create a decision tree by splitting the data set continuously. You can use recursion to complete the condition that the recursive end is to traverse the characteristics of all the child datasets, or all instances under each branch have the same classification, if the dataset has already processed all the features, but the class label is still not unique, At this point we usually use the majority voting method to determine the leaf node classification .
Let's first implement the code for the majority vote:
Import operator# majority Vote Def majoritycnt (classlist): ClassCount = {} for vote in Classlist: if vote not in Classco Unt.keys (): Classcount[vote] = 0 classcount[vote] + = 1 sortedclasscount = sorted (Classcount.iteritems (), key= Operator.itemgetter (1), reverse=true) return sortedclasscount[0][0]
The following implementation creates the decision tree code:
# Create Decision Tree def createtree (dataset,labels): classlist = [Example[-1] For example in DataSet] if Classlist.count ( Classlist[0]) = = Len (classlist): return classlist[0]# category is exactly the same then stop to continue dividing if Len (dataset[0]) = = 1: # Returns the most occurrences of return majoritycnt (classlist) bestfeat = Choosebestfeaturetosplit (dataSet) when all features are traversed Bestfeatlabel = labels[bestfeat] mytree = {bestfeatlabel:{}} del (labels[bestfeat]) featvalues = [Example [Bestfeat] For example in DataSet] uniquevals = set (featvalues) for value in uniquevals: sublabels = Label s[:] #copy All of the labels, so trees don ' t mess up existing labels mytree[bestfeatlabel][value] = Createtree (split DataSet (DataSet, Bestfeat, Value), sublabels)
Test the effect above:
Mytree= createtree (mydat,labels) Print Mytree
Operation Result:
{' No surfacing ': {0: ' No ', 1: {' flippers ': {0: ' No ', 1: ' Yes '}}}
2.4 Using Decision trees for classification
We have built a decision tree based on training data, and now we need to use it to classify the actual data. When you perform a classification, you need a decision tree and a label vector that constructs the decision tree. The program compares the test data to the values on the decision tree, executes the process recursively until it enters the leaf node, and finally defines the test data as the type of the leaf node to which it belongs.
#使用决策树执行分类 def classify (Inputtree, Featlabels, Testvec): firststr = Inputtree.keys () [0] seconddict = INPUTTREE[FIRSTSTR] featindex = Featlabels.index (firststr) #index方法查找当前列表中第一个匹配firstStr变量的元素的索引 for Key in Seconddict.keys (): if testvec[featindex] = = key: if Type (Seconddict[key]). __name__ = = ' Dict ': Classlabel = classify (Seconddict[key], featlabels, Testvec) Else:classlabel = Seconddict[key]
Test results:
Mydat,labels = CreateDataSet () print ' mytree: ', mytreeprint ' Labels: ', Labelsprint classify (mytree,labels,[1,0]) print Classify (mytree,labels,[1,1])
Classify can also be implemented as follows:
def classify (Inputtree,featlabels,testvec): firststr = Inputtree.keys () [0] seconddict = inputtree[firststr] Featindex = Featlabels.index (firststr) key = Testvec[featindex] valueoffeat = Seconddict[key] if Isinstance (Valueoffeat, dict): Classlabel = classify (Valueoffeat, Featlabels, Testvec) Else:classlabel = Valueoffeat return Classlabel
2.5 Storage Decision Tree
Because it is time-consuming to construct a decision tree, you can consider storing the created decision tree on your hard disk.
#决策树的存储def Storetree (inputtree,filename): import pickle fw = open (filename, ' W ') pickle.dump (Inputtree, FW) Fw.close () def grabtree (filename): import pickle fr = open (filename) return Pickle.load (fr )
Test:
Storetree (mytree, ' Mytree.txt ') grabtree (' Mytree.txt ')
Output:
{' No surfacing ': {0: ' No ', 1: {' flippers ': {0: ' No ', 1: ' Yes '}}}
Next article: naive Bayesian algorithm
3. Machine Learning Combat Decision Tree