Machine Learning Practice Note 3 (tree and Random forest)

Source: Internet
Author: User
Tags id3

The advantage is that the form data in a decision tree is easy to understand. And KNN's biggest drawback is the intrinsic meaning of data that cannot be given.

1: This concept is very simple text description

There are very many types of decision trees. There are cart, ID3 and C4.5 and so on. The cart is based on the purity of the Gini (Gini). There is no specific explanation here, and ID3 and C4.5 are based on information entropy, both of which have the same result. This definition is mainly for the ID3 algorithm. Below we describe the definition of information entropy.

The probability that the event AI occurs is represented by P (AI). and-log2 (AI) represents the uncertainty of the event Ai, called the AI's self-information, sum (P (AI) *i (AI)) is called the entropy of the source S average information.

    Decision Tree Learning uses a top-down recursive approach,The basic idea is to construct a tree with entropy as a measure of the fastest descent,The entropy value at the leaf node is zero. At this point, the instances in each leaf node belong to the same class.

The principle of ID3 is based on the information entropy gain to achieve maximum. The labels for the original problem have positive and negative examples, and P and n indicate their corresponding number. The information entropy of the original problem is

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvthu1otcymdm5mzm=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

where n is the number of values that the feature is taken from. For example {Rain,sunny}, then n is 2

Gain = Baseentropy–newentropy

The ID3 principle even if the gain reaches the maximum value. Information gain is the decrease of entropy or data disorder.

ID3 is prone to problems: Assume that there are many other attributes. Easier to make the data more "pure" (especially the continuous value), the information gain is greater, the decision tree will first pick this attribute as the vertex of the tree. The result is a large and very shallow tree, which is extremely unreasonable.

At this time can use C4.5 to solve

C4.5 's idea is to maximize the gain divided by the following formula to get the information gain rate:


The middle end is 2

Implementation of 2:python Code

(1) Calculate information entropy

# #计算给定数据集的信息熵def Calcshannonent (DataSet):    numentries = Len (DataSet)    labelcounts = {} for    Featvec in DataSet:        CurrentLabel = featvec[-1]        if CurrentLabel not in Labelcounts.keys ():     #为全部可能分类创建字典            Labelcounts[currentlabel] = 0        Labelcounts[currentlabel] + = 1    shannonent = 0.0 for    key in labelcounts:        Prob = float (Labelcounts[key])/numentries        shannonent-= prob * log (prob,2)   #以2为底数求对数    return shannonent

(2) Create a data set

#创建数据def CreateDataSet ():    dataSet = [[[], ' yes '], [1,0, ' no '], [0,1, ' no '], [               0, 1, ' No ']]    labels = [' No surfacing ', ' flippers ']    return dataSet, labels

(3) Partitioning data sets

#根据特征划分数据集  Axis represents the first few characteristics value  returns the partitioned DataSet Def splitdataset (DataSet, axis, value) for the corresponding values of the feature:    Retdataset = [] for    Featvec in DataSet:        if featvec[axis] = = value:            Reducedfeatvec = Featvec[:axis]            Reducedfeatvec.extend (featvec[axis+1:])            retdataset.append (Reducedfeatvec)    return Retdataset


(4) Select the best features to divide

#选择最好的数据集 (Feature) Partitioning method  returns the best feature subscript def choosebestfeaturetosplit (dataSet):    numfeatures = Len (dataset[0])-1   # Number of features    baseentropy = calcshannonent (dataSet)    bestinfogain = 0.0; bestfeature =-1 for    I in range (numfeatures):   #遍历特征        featureset = set ([Example[i] For example in DataSet])   #第i个特征取值集合        newentropy= 0.0        for Value in FeatureSet:            subdataset = Splitdataset (DataSet, I, value)            prob = Len (subdataset)/float (len (DataSet))            newentropy + = prob * Calcshannonent (subdataset)   #该特征划分所相应的entropy        infogain = baseentropy-newentropy< C16/>if infogain > Bestinfogain:            bestinfogain = infogain            bestfeature = i    return bestfeature

Note: This dataset needs to meet the following two methods:

<1> All column elements must have the same data length

<2> the last column of the data, or the last element of each instance, is the category label for the current instance.

(5) Code to create a tree

Python uses a dictionary type to store the structure of the tree returns the result is the mytree-dictionary

The   structure of the tree that is stored in the dictionary type #创建树的函数代码 python Returns the result of the mytree-dictionary def createtree (dataSet, labels):    classlist = [Example[-1] for Example in DataSet]    if Classlist.count (classlist[0]) = = Len (classlist):    #类别全然同样则停止继续划分  return class label-leaf node        return classlist[0]    if Len (dataset[0]) = = 1:        return majoritycnt (classlist)       #遍历全然部的特征时返回出现次数最多的    Bestfeat = Choosebestfeaturetosplit (dataSet)    Bestfeatlabel = labels[bestfeat]    mytree = {bestfeatlabel:{}}    Del (Labels[bestfeat])    featvalues = [Example[bestfeat] For example in DataSet]    #得到的列表包括全部的属性值    Uniquevals = Set (featvalues) for    value in uniquevals:        sublabels = labels[:]        mytree[bestfeatlabel][ Value] = Createtree (Splitdataset (DataSet, bestfeat, value), sublabels)    return mytree

The recursive end of the class, and only if the label in the category is exactly the same or traverse all the features at this time the most frequently returned


when all the features are used up, a majority voting method is used to determine the classification of the leaf node, that is, the leaf node belongs to a certain class of the maximum number of samples, then we say that the leaf node belongs to that category!. the code is as follows:

#多数表决的方法决定叶子节点的分类----  is a multi-class Def majoritycnt (Classlist) when all features are applied:    ClassCount = {} for    vote in Classlist:        if vote not in Classcount.key ():            classcount[vote] = 0;        Classcount[vote] + = 1    sortedclasscount = sorted (Classcount.iteritems (), key = Operator.itemgetter (1), reverse = True )  #排序函数    return sortedclasscount[0][0 in operator]

That is, assume that the dataset has processed all the properties. But the class label is still not unique, at this point we have to decide how to define the leaf node, in such a case. We usually use a majority vote method to determine the classification of the leaf node.

(6) Run classification using decision tree

#使用决策树运行分类def classify (Inputtree, Featlabels, Testvec):    firststr = Inputtree.keys () [0]    seconddict = Inputtree [Firststr]    Featindex = Featlabels.index (firststr)   #index方法查找当前列表中第一个匹配firstStr变量的元素的索引 for    key in Seconddict.keys ():        if testvec[featindex] = = key:            if Type (Seconddict[key]). __name__ = = ' Dict ':                Classlabel = classify ( Seconddict[key], Featlabels, Testvec)            Else:classlabel = Seconddict[key]    return Classlabel


It is important to pay attention to the idea of recursion.

(7) Storage of decision Trees

Constructing a decision tree is a very time-consuming task.

To save compute time, it's a good idea to call the already constructed decision tree each time you run the taxonomy.

To solve the problem, you need to use the Python module pickle to serialize the object, and the serialized object can save the object on disk. and read it when it's needed.

#决策树的存储def storetree (inputtree, filename):         #pickle序列化对象. Ability to save objects on disk    import pickle    fw = open (filename, ' W ')    pickle.dump (Inputtree, FW)    Fw.close () def grabtree ( FileName):               #并在须要的时候将其读取出来    import pickle    fr = open (filename)    return Pickle.load (FR)

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvthu1otcymdm5mzm=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

3:matplotlib annotations

Matplotlib provides an annotation tool annotations, which is useful for adding text gaze to data graphics.

Annotation pass is often used to interpret the content of the data.

I don't understand this code, so just give me the code in the book.

#-*-coding:cp936-*-import matplotlib.pyplot as Pltdecisionnode = dict (boxstyle = ' Sawtooth ', FC = ' 0.8 ') Leafnode = dic T (boxstyle = ' Round4 ', FC = ' 0.8 ') Arrow_args = dict (Arrowstyle = ' <-') def plotnode (Nodetxt, Centerpt, PARENTPT, Nodetyp E): CreatePlot.ax1.annotate (nodetxt, xy = parentpt, xycoords = ' axes fraction ', Xytext = cen                             Terpt, textcoords = ' axes fraction ', VA = ' center ', ha = ' center ', bbox = NodeType, Arrowprops = Arrow_args) # Draw tree nodes with text annotations def createplot (): Fig = plt.figure (1, Facecolor = ' white ') FIG . CLF () Createplot.ax1 = Plt.subplot (111, Frameon = False) Plotnode (' A Decision node ', (0.5,0.1), (0.1,0.5), decision Node) plotnode (' A leaf node ', (0.8, 0.1), (0.3,0.8), Leafnode) plt.show () #获取叶子节点数目和树的层数def Getnumleafs (mytree): n Umleafs = 0 Firststr = Mytree.keys () [0] seconddict = mytree[firststr] for key in Seconddict.keys (): if (typ E (Seconddict[key]). __name__ = = ' Dict '): Numleafs + = Getnumleafs (Seconddict[key]) Else:numleafs + = 1 return numleafsdef gettr Eedepth (mytree): maxDepth = 0 firststr = Mytree.keys () [0] seconddict = mytree[firststr] for key in Seconddict.        Keys (): if (Type (Seconddict[key]). __name__ = = ' Dict '): thisdepth = 1+ gettreedepth (Seconddict[key]) else:thisdepth = 1 if thisdepth > maxdepth:maxdepth = thisdepth return maxdepth# update createplot code to get whole tree def p Lotmidtext (Cntrpt, PARENTPT, txtstring): Xmid = (parentpt[0]-cntrpt[0])/2.0 + cntrpt[0] Ymid = (parentpt[1]-cntrpt[1 ])/2.0 + cntrpt[1] CreatePlot.ax1.text (Xmid, Ymid, txtstring, va= "center", ha= "center", rotation=30) def plottree (mytree , PARENTPT, Nodetxt): #if The first key tells what feat is split on numleafs = Getnumleafs (mytree) #this determine s The x width of this tree depth = gettreedepth (mytree) firststr = Mytree.keys () [0] #the text label for this nod E should is this cntrPt = (Plottree.xoff + (1.0 + float (numleafs))/2.0/PLOTTREE.TOTALW, Plottree.yoff) plotmidtext (Cntrpt, PARENTPT, NODETXT ) Plotnode (Firststr, Cntrpt, PARENTPT, decisionnode) seconddict = mytree[firststr] Plottree.yoff = plotTree.yOff -1.0/plottree.totald for key in Seconddict.keys (): If Type (Seconddict[key]). __name__== ' dict ': #test to see if th E nodes is dictonaires, if not they is leaf nodes Plottree (seconddict[key],cntrpt,str (key)) #recurs            Ion else: #it ' s a leaf node print the leaf node Plottree.xoff = Plottree.xoff + 1.0/PLOTTREE.TOTALW Plotnode (Seconddict[key], (Plottree.xoff, Plottree.yoff), Cntrpt, Leafnode) Plotmidtext ((Plottree.xoff , Plottree.yoff), Cntrpt, str (key)) Plottree.yoff = Plottree.yoff + 1.0/plottree.totald#if do get a dictonary you K Now it's a tree, and the first element would be another dictdef Createplot (intree): Fig = plt.figure (1, facecolor= ' White ') FIG.CLF () AxproPS = Dict (xticks=[], yticks=[]) Createplot.ax1 = Plt.subplot (111, Frameon=false, **axprops) #no ticks #createPlot . Ax1 = Plt.subplot (111, Frameon=false) #ticks for demo puropses plottree.totalw = float (Getnumleafs (intree)) plottr Ee.totald = Float (gettreedepth (intree)) Plottree.xoff = -0.5/PLOTTREE.TOTALW;    Plottree.yoff = 1.0; Plottree (Intree, (0.5,1.0), ') Plt.show ()

The index method returns the indexes for the element that finds the first match firststr in the current list.

4: Use decision tree to predict contact lens type



Note: 1: This notebook is from books < machine learning combat >

Data for 2:knn.py files and notes are downloaded here (http://download.csdn.net/detail/lu597203933/7660737).

————————————————————————————————————————————————————————
Random Forest
Maybe we added a random forest section.
 

We can solve this in two ways. One is pruning. The second is the random forest, in fact the latter is simpler and more convenient. So here are the random forests.

  
bagging method:   Introduction to the random forest front. We need to introduce bagging method, bagging is bootstrap aggregative meaning, bootstraping idea is to rely on your own resources. Called self-help method, it is a kind of sampling method which has put back; metaphor is not necessary for outside help. Just rely on their own strength to become better--pull up by your own bootstraps! 
The following bagging policy:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvthu1otcymdm5mzm=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

Note: The re-sample is a sample that has a return of N.

Random Forest is the bagging method of decision tree can be . Of course, the algorithm used here can be C4.5 or ID3 or cart, which is described below as a cart:
References: HTTP://WWW.JULYEDU.COM/VIDEO/PLAY/ID/17
Small village head Source: http://blog.csdn.net/lu597203933 Welcome to reprint or share. But be sure to indicate the source of the article.

(Sina Weibo: small village Zack, Welcome to Exchange!) )

Machine Learning Practice Note 3 (tree and Random forest)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.