Python implementation method of decision tree, python of decision tree

Source: Internet
Author: User

Python implementation method of decision tree, python of decision tree

This article describes the python implementation method of decision tree. Share it with you for your reference. The specific implementation method is as follows:

Advantages and disadvantages of Decision Tree algorithms:

Advantages: low computing complexity, easy to understand output results, insensitive to missing median values, and the ability to process irrelevant feature data

Disadvantage: the problem of over-matching may occur.

Applicable data types: numeric and nominal

Algorithm idea:

1. The overall idea of decision tree construction:

To put it bluntly, the decision tree is like the if-else structure. The result is that you want to generate a tree that can continuously judge and select the leaf node from the root, however, the if-else here will not let us think that we should set it. What we need to do is to provide a method for computers to obtain the decision tree we need based on this method. The focus of this method is how to select valuable from so many features and choose from root to leaf in the best order. After this is done, we can recursively construct a decision tree.

2. Information Gain

The principle of dividing a dataset is to make disordered data more orderly. Since this involves the order and disorder of information, we naturally need to think of the information entropy. Here we use information entropy (another method is the non-purity of Gini ). The formula is as follows:

Data requirements:

① Data must be a list composed of list elements, and all columns must have the same data length.
② The last column of data or the last element of each instance should be the category tag of the current instance

Function:

CalcShannonEnt (dataSet)
Calculate the Shannon entropy of a dataset in two steps. The first step is to calculate the frequency. The second step is to calculate the Shannon entropy based on the formula.

SplitDataSet (dataSet, aixs, value)
Divide a dataset. All values that meet X [aixs] = value are grouped together, and a well-divided set is returned (excluding the aixs attribute used for Division because it is not required)

ChooseBestFeature (dataSet)
When selecting the best attribute for division, the idea is very simple: divide each attribute to see which one is better. Here we use a set to select the unique element in the list. This is a very fast method.

MajorityCnt (classList)
Because recursive decision tree construction is calculated based on the consumption of attributes, the final attribute may be used up, but the classification is still not completed. In this case, the node classification will be calculated by majority voting.

CreateTree (dataSet, labels)
Build a decision tree based on recursion. The label here is more about the name of the classification feature, for better understanding and later.

Copy codeThe Code is as follows:
# Coding = UTF-8
Import operator
From math import log
Import time

Def createDataSet ():
DataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']
Labels = ['no surfaceing', 'flippers']
Return dataSet, labels

# Calculate Shannon entropy
Def calcShannonEnt (dataSet ):
NumEntries = len (dataSet)
LabelCounts = {}
For feaVec in dataSet:
CurrentLabel = feaVec [-1]
If currentLabel not in labelCounts:
LabelCounts [currentLabel] = 0
LabelCounts [currentLabel] + = 1
ShannonEnt = 0.0
For key in labelCounts:
Prob = float (labelCounts [key])/numEntries
ShannonEnt-= prob * log (prob, 2)
Return shannonEnt

Def splitDataSet (dataSet, axis, value ):
RetDataSet = []
For featVec in dataSet:
If featVec [axis] = value:
ReducedFeatVec = featVec [: axis]
ReducedFeatVec. extend (featVec [axis + 1:])
RetDataSet. append (reducedFeatVec)
Return retDataSet

Def chooseBestFeatureToSplit (dataSet ):
NumFeatures = len (dataSet [0])-1 # because the last item of the dataSet is a tag
BaseEntropy = calcShannonEnt (dataSet)
BestInfoGain = 0.0
BestFeature =-1
For I in range (numFeatures ):
FeatList = [example [I] for example in dataSet]
UniqueVals = set (featList)
Newentropy= 0.0
For value in uniqueVals:
SubDataSet = splitDataSet (dataSet, I, value)
Prob = len (subDataSet)/float (len (dataSet ))
NewEntropy + = prob * calcShannonEnt (subDataSet)
InfoGain = baseEntropy-newEntropy
If infoGain> bestInfoGain:
BestInfoGain = infoGain
BestFeature = I
Return bestFeature

# Because recursive decision tree construction is calculated based on the consumption of attributes, the final attribute may be used up, but the classification
# If the calculation is not complete, the node classification will be calculated by a majority vote.
Def majorityCnt (classList ):
ClassCount = {}
For vote in classList:
If vote not in classCount. keys ():
ClassCount [vote] = 0
ClassCount [vote] + = 1
Return max (classCount)

Def createTree (dataSet, labels ):
ClassList = [example [-1] for example in dataSet]
If classList. count (classList [0]) = len (classList): # if the category is the same, stop partitioning.
Return classList [0]
If len (dataSet [0]) = 1: # all features are used up
Return majorityCnt (classList)
BestFeat = chooseBestFeatureToSplit (dataSet)
BestFeatLabel = labels [bestFeat]
MyTree = {bestFeatLabel :{}}
Del (labels [bestFeat])
FeatValues = [example [bestFeat] for example in dataSet]
UniqueVals = set (featValues)
For value in uniqueVals:
SubLabels = labels [:] # copy the original list without changing the content
MyTree [bestFeatLabel] [value] = createTree (splitDataSet (dataSet,
BestFeat, value), subLabels)
Return myTree

Def main ():
Data, label = createDataSet ()
T1 = time. clock ()
MyTree = createTree (data, label)
T2 = time. clock ()
Print myTree
Print 'execute for ', t2-t1
If _ name __= = '_ main __':
Main ()

I hope this article will help you with Python programming.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.