"Reading notes" machine learning combat-decision tree (1)

Last Update:2015-04-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to Algorithms

The KNN in the previous chapter is more like applying statistical knowledge to scientific predictions, and it can accomplish many classification tasks. But the biggest drawback is the inability to give the intrinsic meaning of the data, and the decision tree algorithm data form is very easy to understand. The results of decision trees are often applied to expert systems.

The process of building a decision tree:

检测数据集中每一个子祥的属性是否属于同一类     ifreturn 类标签；     else         寻找划分数据集的最好特征         划分数据集         创建分支结点            for 每个划分的子集                调用createBranch并增加返回结果到分支结点中         return 分支结点

General flow of decision trees

Collect Data
Prepare the data: The decision tree algorithm only applies to the nominal-type data, the numerical data must be discretized
Analyze data: Check for expected after tree construction is complete
Training algorithm: Structure of structure tree
Test algorithm: Calculate error rate

Information gain

The change in the information before and after the data set is the gain, and the most active information gain is the best choice. In other words, the information gain and the "entropy" (entropy) are the attribute selection functions of the decision tree. Entropy is the manifestation of the disorder of information in data set, which is the same as the meaning of entropy in other fields.
The entropy used in the book is calculated as follows:

where P (xi) is the probability of selecting a classification
There are also information expectations for all categories:

The code to complete the above calculation entropy is as follows: '

 def calcshannonent(dataSet):NumEntries = Len (dataSet) labelcounts = {} forFeatvecinchDataSet:#the the number of unique elements and their occuranceCurrentLabel = featvec[-1]ifCurrentLabel not inchLabelcounts.keys (): Labelcounts[currentlabel] =0Labelcounts[currentlabel] + =1Shannonent =0.0     forKeyinchLabelcounts:prob = float (Labelcounts[key])/numentries shannonent-= prob * log (prob,2)#log Base 2    returnShannonent

Partitioning data sets

The key to partitioning a dataset is to find the right eigenvalues to divide. We need to try to divide each eigenvalue in the dataset and calculate the entropy of that division. The final result is obtained by comparison.

To divide a dataset according to a given characteristic value:

def splitDataSet(dataSet, axis, value):    retDataSet = []    forin dataSet:        if featVec[axis] == value:            #把特征值去除            reducedFeatVec = featVec[:axis]                 reducedFeatVec.extend(featVec[axis+1:])            retDataSet.append(reducedFeatVec)    return retDataSet

In order to calculate the entropy of the remaining set of eigenvalues in the data set, the eigenvalue of entropy increment is obtained, which is the characteristic value of dividing data set.

 def choosebestfeaturetosplit(dataSet):Numfeatures = Len (dataset[0]) -1      #the last column was used for the labelsBaseentropy = Calcshannonent (dataSet) Bestinfogain =0.0; Bestfeature =-1     forIinchRange (Numfeatures):#iterate over all the featuresFeatlist = [Example[i] forExampleinchDataSet]#create A list of all the examples of this featureUniquevals = Set (Featlist)#get A set of unique valuesNewentropy =0.0         forValueinchUniquevals:subdataset = Splitdataset (DataSet, I, value) prob = Len (subdataset)/float (len (DataSet)) Newentropy + = prob * Calcshannonent (Subdataset)#calculate the info gain; IE reduction in entropyInfogain = Baseentropy-newentropy#compare the the best gain so far        if(Infogain > Bestinfogain):#if better than, set to bestBestinfogain = Infogain Bestfeature = i#returns an integer    returnBestfeature

Recursive Construction decision Tree

Combining the examples in the book, we can see very intuitively how to build a decision tree. In fact, each step is to use the above method to partition the data set process.
The condition for recursive termination is that all the attributes in the dataset have been traversed or all instances under each branch have the same classification.
But what if all the attributes in the dataset have been divided and there are still instances under a branch that do not have the same classification? The method given in the book is a majority vote and a very reasonable choice.

def majorityCnt(classList):    classCount={}    forin classList:        ifnotin0        1    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]

This is very similar to the voting portion of the KNN algorithm.

The next step is to create a decision tree code based on the above method:

 def createtree(dataset,labels):Classlist = [example[-1] forExampleinchDataSet]#当某一分支下所有数据的类型相同停止    ifClasslist.count (classlist[0] = = Len (classlist):returnclasslist[0]#当数据集中所有属性已经被划分完毕时结束, the two cases are combined here, regardless of whether the instances under the last branch belong to the same class.     ifLen (dataset[0]) ==1:returnMAJORITYCNT (classlist) bestfeat = Choosebestfeaturetosplit (dataSet) Bestfeatlabel = labels[bestfeat] MyTree = {be stfeatlabel:{}}del(Labels[bestfeat]) featvalues = [Example[bestfeat] forExampleinchDataSet] Uniquevals = set (Featvalues) forValueinchUniquevals:#python中传入的参数为列表是使传入引用, so here's a copy.Sublabels = labels[:] mytree[bestfeatlabel][value] = Createtree (Splitdataset (DataSet, bestfeat, value), Subla BELs)returnMytree

"Reading notes" machine learning combat-decision tree (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Reading notes" machine learning combat-decision tree (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Reading notes" machine learning combat-decision tree (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support