The path of machine learning--decision Tree

Source: Internet
Author: User

First, Introduction:

In the previous chapter, we talked about the KNN algorithm, although it can accomplish a lot of classification tasks, but its biggest disadvantage is unable to give the intrinsic meaning of the data, and the main advantage of decision tree is that the data form is very easy to understand. The decision tree algorithm can read the data collection, the decision tree is an important task for the knowledge information contained in the data, so the decision tree can use unfamiliar data collection, and extract a series of rules, in these machines according to the data set creation rules are, is machine learning process.

Second, relevant knowledge

1 Decision Tree algorithm

The first problem that needs to be solved when constructing a decision tree is how to determine which feature is decisive in classifying the data, or which feature classification is used to achieve the best classification effect. In this way, in order to find the decisive characteristics and divide the best results, we need to evaluate each feature. When the optimal feature is found, the dataset is divided into subsets of data that are distributed across all branches of the decision point. At this point, if the data under a branch is of the same type, the data classification under that branch is complete without the next dataset classification, and if the data in a subset of the branches is not of the same type, then the process of dividing the dataset is repeated, according to the same principle of dividing the original dataset, Identify the optimal features in the subset of data and continue to classify subsets of data until all the features have been traversed, or the data under all leaf node branches has the same classification.

The pseudo-code function that creates the branch Createbranch () is as follows:

Detects whether each subkey in the dataset belongs to the same category:

If so return class label; else      look for the best feature partitioning dataset to partition Datasets      Create a branch node for            each branch node                 call function Createbranch and increase the return node to the branch node./ Recursive call Createbranch ()      Return branch node

Having learned how to divide a dataset, we can summarize the general process of the decision tree:

(1) Collect data

(2) Preparation data: The Construction tree algorithm only applies to nominal-type data, so the numerical data must be discretized

(3) Analysis data

(4) Training data: The structure of decision tree in the process of constructing tree

(5) Test algorithm: Calculate error rate using XP tree

(6) Using algorithms: To better understand the intrinsic meaning of data in practice

2 Rules for Best Feature selection: Information gain

The big rule of dividing a dataset is to make the unordered data more orderly. The change of information before and after the data set is called the information gain, if we know how to calculate the information gain, we can calculate the information gain of each eigenvalue partition data set, and the characteristic that obtains the highest information gain is the best characteristic.

Next, we learn how to calculate the gain of information, and we have to mention the concept of "Shannon entropy", or simply entropy. Entropy is defined as the expectation of information.

If the object to be classified may have multiple result x, then the probability of the result XI of the I is P (xi), then we can calculate the information entropy of Xi as L (xi) =p (xi) log (1/p (xi)) =-p (xi) log (P (xi))

So, for all the possible results, the information desired value (information entropy) contained in the thing is: H=-σp (xi) log (P (xi)), I is the result of all possible

Thus, assuming that a feature a in the dataset is used to classify the dataset D (the classification category of D has N), and the feature a has k, then the information gain for classifying the dataset using feature A is:

Information gain h (d,a) = Information entropy of the original data set H (D)-feature A divides the data set after the information entropy H (d/a)

where H (d/a) =∑| aj|/| D|*h (Aj), J belongs to the K-type value of a, | Aj| and | d|, respectively, shows that the number of samples of a J of a feature is the proportion of the total number of samples taken, and the total number of samples of the dataset

Third, construct decision tree

After knowing how to select the optimal characteristics of the partitioning data, we can build a decision tree based on this.

1 Since we are using the Shannon entropy formula more than once, we write a formula that calculates the entropy of a given dataset:

#计算给定数据集的熵 # import log operator from math import logdef calent (DataSet):    #获取数据集的行数    Numentries=len (DataSet)    # Set the data structure of the dictionary    labelcounts={}    #提取数据集的每一行的特征向量 for Featvec in    dataSet:        #获取特征向量的最后一列的标签        CurrentLabel =featvec[-1]        #检测字典的关键字key中是否存在该标签        #如果不存在keys () keyword        if CurrentLabel not in Labelcounts.keys ():            # Labelcounts[currentlabel]=0 the current label/0 key value pair into the            dictionary        #否则将当前标签对应的键值加1        labelcounts[currentlabel]+=1    # The initialization entropy is 0    ent=0.0    #对于数据集中所有的分类类别 for    key in labelcounts:        #计算各个类别出现的频率        prob=float ( Labelcounts[key])/numentries        #计算各个类别信息期望值        ent-=prob*log (prob,2)    #返回熵    return Ent

2 We certainly need to build a dataset for the decision tree:

#创建一个简单的数据集 # Data set contains two features ' No surfacing ', ' flippers '; #数据的类标签有两个 ' yes ', ' No ' def creatdataset ():    dataset=[[1,1, ' yes ') ,            [All, ' yes '],            [1,0, ' no '], [0,1, ' no ']            ,            [0,1, ' no ']]    labels=[' no surfacing ', ' flippers ']    # Return data set and class label return    dataset,labels    

It should be explained that the higher the entropy, the more data is mixed, and if we add more categories to the data set, the entropy result will increase.

3 Next we're going to use the information gain formula above to get the most characteristic of the data set, thus dividing the data set

First, the code for the data set is divided:

#划分数据集: Partitioning data sets by optimal features # @dataSet: Datasets to be divided # @axis: Features that divide the dataset # @value: The value of the feature Def Splitdataset (dataset,axis,value):    # It is necessary to note that when the Python language passes the parameter list, the reference #如果在函数内部对列表对象进行修改 of the list is passed,    causing the list to change, and in order to    #不修改原始数据集, create a new list object to manipulate    Retdataset=[]    #提取数据集的每一行的特征向量    for Featvec in dataset:        #针对axis特征不同的取值, divide the dataset into separate branches        #如果该特征的取值为value        if Featvec[axis]==value:            #将特征向量的0 ~axis-1 column deposit list Reducedfeatvec            Reducedfeatvec=featvec[:axis]            # Axis+1~ the last column of the eigenvector into the list Reducedfeatvec            #extend () is to add an element from another list (in the list of elements) one by one to the current list to form a list            #比如a =[1,2,3],b =[4,5,6], then A.extend (b) =[1,2,3,4,5,6]            reducedfeatvec.extend (featvec[axis+1:])            #简言之, is to remove the original dataset from the feature column that is currently dividing the data            #append () is to add another list (as a list object) to the current list            # #比如a =[1,2,3],b=[4,5,6], then A.extend (b) =[1,2,3,[ 4,5,6]]            retdataset.append (Reducedfeatvec)    return Retdataset

It should be stated that:

(1) In the partitioning of the dataset function, passed the parameter DataSet List reference, within the function to modify the list object, will cause the list content to change, so, in order to eliminate the effect, we should create a new list object in the function, Save the data set after the operation of the list object into a new list object

(2) to distinguish between the append () function and the Extend () function

The two methods are similar in functionality, adding new elements at the end of the list, but the processing results are different when working with multiple lists:
For example: a=[1,2,3],b=[4,5,6]

Then the result of A.append (b) is: [1,2,3,[4,5,6]], that is, the use of the Append () function adds a new list object to the end of the list B

The result of A.extend (b) is: [1,2,3,4,5,6], that is, using the Extend () function

Next, let's look at the code that picks the best feature:

#如何选择最好的划分数据集的特征 # using a feature to divide the data set, the maximum information gain is selected as the best Feature Def choosebestfeaturetosplit (DataSet): #获取数据集特征的数目 (class label that does not contain the last column) Numfeatures=len (Dataset[0])-1 #计算未进行划分的信息熵 baseentropy=calent (dataSet) #最优信息增益 Optimal features Bestinfogain=0.0;bestfe Ature=-1 #利用每一个特征分别对数据集进行划分, calculate information gain for I in range (numfeatures): #得到特征i的特征值列表 Featlist=[example[i] For Example in DataSet] #利用set集合的性质-The uniqueness of the element, which obtains the Uniquevals=set (featlist) #信息增益0 of the feature I. 0 newentropy= 0.0 #对特征的每一个取值, respectively, to build the corresponding branch for value in Uniquevals: #根据特征i的取值将数据集进行划分为不同的子集 #利用splitData            Set () Gets the feature value that the branch contains the DataSet Subdataset=splitdataset (Dataset,i,value) #计算特征取值value对应子集占数据集的比例 Prob=len (Subdataset)/float (len (dataSet)) #计算占比 The information entropy of the current subset and accumulates it to get the total information entropy newentropy+=prob*calent (subdat        ASet) #计算按此特征划分数据集的信息增益 #公式特征A, DataSet D #则H (d,a) =h (D)-H-(d/a) infogain=baseentropy-newentropy       #比较此增益与当前保存的最大的信息增益 if (infogain>bestinfogain): #保存信息增益的最大值 bestinfogain=infogain #相应地保存得到此最大增益的特征i Bestfeature=i #返回最优特征 return Bestfeature

In a function call, the data must meet certain requirements, first, the data must be a list of elements of the list, and all the list elements have the same data length; second, the last column of the data, or the last element of each instance, is the category label of the current instance. In this way, we can complete the partition of the data set through the program unification

4, after learning through each of the above modules, we will then really build a decision tree, the construction decision tree works as follows: First get the original data set, and then based on the best attributes to partition the data set, because the eigenvalues may be more than two, so there may be more than two branches of the dataset partition. After the first partition, the data is passed down to the next node in the tree branch, where we can divide the data again. Therefore, we can use recursive method to process the data set and complete the decision tree construction.

The recursive condition is that the program iterates through all the attributes of the partitioned dataset, or all instances under each sub-partition have the same classification. If all instances have the same classification, a leaf node or a terminating block is obtained.

Of course, we may encounter when all the feature attributes are traversed, but the instance class label is still not unique under one or more branches, at which point we need to determine how to define the leaf node, in which case By adopting the principle of majority voting, select the classification of the most class label category in the sub-branch instance as the leaf node.

In this way, we need to define a majority voting function majoritycnt ()

#当遍历完所有的特征属性后, class labels are still not unique (there are still instances of different classifications under branches) #采用多数表决的方法完成分类def majoritycnt (classlist):    #创建一个类标签的字典    classcount={}    #遍历类标签列表中每一个元素 for    vote in Classlist:        #如果元素不在字典中        if vote not in Classcount.keys ():            #在字典中添加新的键值对            classcount[vote]=0        #否则, the current key for the value of add 1        classcount[vote]+=1    #对字典中的键对应的值所在的列, according to the order of large to small    #@ Classcount.items List Object    # @key =operator.itemgetter (1) Gets the value of the first field of a list object    # @reverse =true descending sort, by default ascending sort    Sortedclasscount=sorted (Classcount.items,    key=operator.itemgetter (1), reverse=true)    #返回出现次数最多的类标签    return sortedclasscount[0][0]

Well, with this in mind, we can write the decision tree's build code recursively.

#创建树def Createtree (dataset,labels): #获取数据集中的最后一列的类标签, deposit classlist list classlist=[example[-1] For example in DataSet]        The #通过count () function gets the number of first class labels in the class label list #判断数目是否等于列表长度, all class labels of the same surface are the same, and belong to the same class if Classlist.count (Classlist[0]) ==len (classlist): return classlist[0] #遍历完所有的特征属性, when the dataset is listed as 1, that is, only the class label column if Len (dataset[0]) ==1: #多数表决原则, determine the class label return Majo RITYCNT (classlist) #确定出当前最优的分类特征 bestfeat=choosebestfeaturetosplit (dataSet) #在特征标签列表中获取该特征对应的值 bestfeatlabel=l Abels[bestfeat] #采用字典嵌套字典的方式, store the Classification tree Information mytree={bestfeatlabel:{}} # #此位置书上写的有误, the book is del (Labels[bestfeat]) # #相当于操 As the original list content, causing the original list content to change # #按此运行程序, an error ' no surfacing ' is a in list # #以下代码已改正 #复制当前特征标签列表 to prevent changes to the contents of the original list sublabels=l abels[:] #删除属性列表中当前分类数据集特征 del (sublabels[bestfeat]) #获取数据集中最优特征所在列 Featvalues=[example[bestfeat] For example I N DataSet] #采用set集合性质, gets all unique values for the feature Uniquevals=set (featvalues) #遍历每一个特征取值 for value in Uniquevals: #采用递归        Method uses this feature to classify a data set# @bestFeatLabel feature tag values for categorical features # @dataSet datasets to classify # @bestFeat the nominal value of the categorical feature # @value the value of the nominal feature # @subLabels     Sub-feature label list after removing classification features Mytree[bestfeatlabel][value]=createtree (Splitdataset (dataset,bestfeat,value), sublabels)     Return mytree

It is necessary to note that at this time the parameter dataset is a reference to the list, and we cannot modify the list directly in the function, but it is not advisable to create a new list object in the code that has del (Labels[bestfeat]) in the delete list of a column operation. sublabels =labels[:], call Function del (Sublabels[bestfeat])

OK, Next run the code:

5 Next, we can do the actual classification through the decision tree, using the built decision tree, input matching test data, compare the test data and decision tree values, recursively execute the process until the leaf node, and finally define the test data as leaf node all the classification, output classification results

The decision tree classification function code is:

#------------------------Test Algorithm------------------------------    #完成决策树的构造后, using decision tree to implement concrete application # @intputTree Build a decision tree #@ Featlabels feature label List # @testVec test Instance def classify (Inputtree,featlabels,testvec):    #找到树的第一个分类特征, or root node ' no surfacing '    #注意python2. x and 3.x differences, 2.x can be written as Firststr=inputtree.keys () [0]    #而不支持3. x    firststr=list (Inputtree.keys ()) [0]    #从树中得到该分类特征的分支, there are 0 and 1    seconddict=inputtree[firststr]    #根据分类特征的索引找到对应的标称型数据值    # ' no surfacing ' The corresponding index is 0    featindex=featlabels.index (firststr)    #遍历分类特征所有的取值 for    key in Seconddict.keys ():        # The No. 0 characteristic value of the test instance is equal to the first key child node        if Testvec[featindex]==key:            the #type () function determines whether the child node is a dictionary type            if type (Seconddict[key] ). __name__== ' dict ':                #子节点为字典类型, continue traversing the classification                classlabel=classify (SECONDDICT[KEY],FEATLABELS,TESTVEC) from the branch tree            #如果是叶子节点, the return node value            Else:classlabel=seconddict[key]    return Classlabel

The input instance, through the classification function to obtain the forecast result, may compare with the actual result, calculates the error rate

6 We say a good classification algorithm to be able to complete the actual application of the need, decision tree algorithm is no exception, an algorithm is not good, or need the actual application of the test, then we will use an example to predict the type of contact lenses using decision Trees

First, we know that building decision trees is a time-consuming task, and even a small data set can take a few seconds to build a decision tree, which is obviously computationally time-consuming. So, we can save the built-in decision tree on disk so that when we need it, it will be read out of the disk.

How to serialize the object, Python's pickle module is sufficient for the task, any object can be serialized through the Pickle module, the dictionary is no exception, using the Pickle module to store and read the decision tree file code is as follows:

#决策树的存储: Python's pickle module serializes the decision tree object so that the decision tree is saved on disk # when needed, can save time in the construction tree when the data set is large #pickle the module storage Decision Tree def storetree (Inputtree, FileName):    #导入pickle模块    import pickle    #创建一个可以 ' write ' text file    #这里, if you press ' W ' in the tree, it will be error write () argument must be Str,not bytes    #所以这里改为二进制写入 ' WB '    fw=open (filename, ' WB ')    #pickle的dump函数将决策树写入文件中    pickle.dump ( INPUTTREE,FW)    #写完成后关闭文件    fw.close () #取决策树操作    def grabtree (filename):    import Pickle    # Corresponds to binary write data, ' RB ' reads data in binary form    fr=open (filename, ' RB ')    return Pickle.load (FR)

Here, the write operation of the file is ' WB ' or ' wb+ ', which means that the data is written in byte form, and the corresponding ' RB ' reads the data in byte form.

Next, we will build a decision tree from the contact lens data set to predict the type of contact lens the patient needs to wear, steps as follows:

(1) Collect data: Text data set ' Lenses.txt '

(2) Preparing Data: Resolving tab-delimited data rows

(3) Analyze data: Quickly check data to ensure that data content is parsed correctly

(4) Training algorithm: Building decision Tree

(5) Test algorithm: The decision tree is constructed to accurately predict the results of the classification

(6) The classification of the algorithm to meet the requirements of the exact class, the decision tree is stored, the next time you need to read the use

The function codes for parsing text datasets and building contact lens type decision trees are as follows:

#------------------------Example: Using decision trees to predict contact lens type----------------def predictlensestype (filename):    #打开文本数据    fr= Open (filename)    #将文本数据的每一个数据行按照tab键分割 and in turn lenses    lenses=[inst.strip (). Split (' \ t ') for Inst in Fr.readlines ()]    #创建并存入特征标签列表    lenseslabels=[' age ', ' prescript ', ' astigmatic ', ' tearrate ']    # Create a decision tree    lensestree=createtree (lenses,lenseslabels)    return lensestree based on the data set and the list of feature labels that are obtained from the continuation file

Of course, we can also use the Python matplotlib tool to draw a tree diagram of the decision tree, because the content is too much, it is not explained

Next, the problem of the possible or excessive matching (overfitting) of the decision tree algorithm is added, and when the decision tree complexity is large, it is likely to cause overfitting problems. At this point, we can reduce the complexity of decision tree and improve the generalization ability of decision tree by cutting decision tree. For example, if a leaf node of a decision tree can only add very little information, then we can delete the node and merge it into adjacent nodes, thus reducing the complexity of the decision tree and eliminating the overfitting problem.

The path of machine learning--decision Tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.