1. Background
Decision Book algorithm is a kind of classification algorithm approximating discrete numbers, which is simpler and more accurate. International authoritative academic organization, Data Mining International conference ICDM (the IEEE International Conference on Data Mining) in December 2006, selected the ten classical algorithms in the field of mining, C4.5 algorithm ranked first. C4.5 algorithm is a kind of classification decision tree algorithm in machine learning algorithm, its core algorithm is ID3 algorithm.
The main idea of the algorithm is to arrange the influence of the dataset according to the feature on the target index from high to low. Row into a binary tree sequence, sorted, as shown in the following figure.
The problem now is that when we have many eigenvalues, which ones are written on the node above the binary tree, which is written below. We can intuitively see that the above eigenvalue node should be a significant impact on the target index of some eigenvalue. So how do you compare which eigenvalues have a greater impact on the target index? This leads to the concept of information entropy.
One of the originator of information theory Claude E. Shannon defines information (entropy) as the probability of occurrence of discrete random events. To be clear, the greater the value of information entropy, the more confusing the information set.
The calculation formula of information entropy, (suggest to go to Wiki to study)
Here we calculate the entropy of the target exponent and the difference of entropy, that is, the gain to determine which eigenvalues have the most influence on the target index.
2. Data sets
3. Code
(1) The first part-computational entropy
The function is to find out several target indices and calculate their information entropy according to the frequency they appear.
def calcshannonent (DataSet):
Numentries=len (DataSet)
labelcounts={} for
Featvec in dataset:
CURRENTLABEL=FEATVEC[-1]
if CurrentLabel not in Labelcounts.keys ():
labelcounts[currentlabel]=0
Labelcounts[currentlabel]+=1
shannonent=0.0 for
key in labelcounts:
prob =float (labelcounts[key))/ NumEntries
Shannonent-=prob*math.log (prob,2) return
shannonent
More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/sjjg/