"Reading notes" machine learning combat-decision tree (1)

Source: Internet
Author: User

Introduction to Algorithms

The KNN in the previous chapter is more like applying statistical knowledge to scientific predictions, and it can accomplish many classification tasks. But the biggest drawback is the inability to give the intrinsic meaning of the data, and the decision tree algorithm data form is very easy to understand. The results of decision trees are often applied to expert systems.

The process of building a decision tree:

检测数据集中每一个子祥的属性是否属于同一类     ifreturn 类标签;     else         寻找划分数据集的最好特征         划分数据集         创建分支结点            for 每个划分的子集                调用createBranch并增加返回结果到分支结点中         return 分支结点

General flow of decision trees

    1. Collect Data
    2. Prepare the data: The decision tree algorithm only applies to the nominal-type data, the numerical data must be discretized
    3. Analyze data: Check for expected after tree construction is complete
    4. Training algorithm: Structure of structure tree
    5. Test algorithm: Calculate error rate
Information gain

The change in the information before and after the data set is the gain, and the most active information gain is the best choice. In other words, the information gain and the "entropy" (entropy) are the attribute selection functions of the decision tree. Entropy is the manifestation of the disorder of information in data set, which is the same as the meaning of entropy in other fields.
The entropy used in the book is calculated as follows:

where P (xi) is the probability of selecting a classification
There are also information expectations for all categories:

The code to complete the above calculation entropy is as follows: '

 def calcshannonent(dataSet):NumEntries = Len (dataSet) labelcounts = {} forFeatvecinchDataSet:#the the number of unique elements and their occuranceCurrentLabel = featvec[-1]ifCurrentLabel not inchLabelcounts.keys (): Labelcounts[currentlabel] =0Labelcounts[currentlabel] + =1Shannonent =0.0     forKeyinchLabelcounts:prob = float (Labelcounts[key])/numentries shannonent-= prob * log (prob,2)#log Base 2    returnShannonent
Partitioning data sets

The key to partitioning a dataset is to find the right eigenvalues to divide. We need to try to divide each eigenvalue in the dataset and calculate the entropy of that division. The final result is obtained by comparison.

To divide a dataset according to a given characteristic value:

def splitDataSet(dataSet, axis, value):    retDataSet = []    forin dataSet:        if featVec[axis] == value:            #把特征值去除            reducedFeatVec = featVec[:axis]                 reducedFeatVec.extend(featVec[axis+1:])            retDataSet.append(reducedFeatVec)    return retDataSet

In order to calculate the entropy of the remaining set of eigenvalues in the data set, the eigenvalue of entropy increment is obtained, which is the characteristic value of dividing data set.

 def choosebestfeaturetosplit(dataSet):Numfeatures = Len (dataset[0]) -1      #the last column was used for the labelsBaseentropy = Calcshannonent (dataSet) Bestinfogain =0.0; Bestfeature =-1     forIinchRange (Numfeatures):#iterate over all the featuresFeatlist = [Example[i] forExampleinchDataSet]#create A list of all the examples of this featureUniquevals = Set (Featlist)#get A set of unique valuesNewentropy =0.0         forValueinchUniquevals:subdataset = Splitdataset (DataSet, I, value) prob = Len (subdataset)/float (len (DataSet)) Newentropy + = prob * Calcshannonent (Subdataset)#calculate the info gain; IE reduction in entropyInfogain = Baseentropy-newentropy#compare the the best gain so far        if(Infogain > Bestinfogain):#if better than, set to bestBestinfogain = Infogain Bestfeature = i#returns an integer    returnBestfeature
Recursive Construction decision Tree

Combining the examples in the book, we can see very intuitively how to build a decision tree. In fact, each step is to use the above method to partition the data set process.
The condition for recursive termination is that all the attributes in the dataset have been traversed or all instances under each branch have the same classification.
But what if all the attributes in the dataset have been divided and there are still instances under a branch that do not have the same classification? The method given in the book is a majority vote and a very reasonable choice.

def majorityCnt(classList):    classCount={}    forin classList:        ifnotin0        1    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]

This is very similar to the voting portion of the KNN algorithm.

The next step is to create a decision tree code based on the above method:

 def createtree(dataset,labels):Classlist = [example[-1] forExampleinchDataSet]#当某一分支下所有数据的类型相同停止    ifClasslist.count (classlist[0] = = Len (classlist):returnclasslist[0]#当数据集中所有属性已经被划分完毕时结束, the two cases are combined here, regardless of whether the instances under the last branch belong to the same class.     ifLen (dataset[0]) ==1:returnMAJORITYCNT (classlist) bestfeat = Choosebestfeaturetosplit (dataSet) Bestfeatlabel = labels[bestfeat] MyTree = {be stfeatlabel:{}}del(Labels[bestfeat]) featvalues = [Example[bestfeat] forExampleinchDataSet] Uniquevals = set (Featvalues) forValueinchUniquevals:#python中传入的参数为列表是使传入引用, so here's a copy.Sublabels = labels[:] mytree[bestfeatlabel][value] = Createtree (Splitdataset (DataSet, bestfeat, value), Subla BELs)returnMytree

"Reading notes" machine learning combat-decision tree (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.