The decision tree of the Python implementation of machine learning algorithm-decision trees (1) Information entropy partition DataSet

Source: Internet
Author: User

1. Background

Decision Book algorithm is a kind of classification algorithm approximating discrete numbers, which is simpler and more accurate. International authoritative academic organization, Data Mining International conference ICDM (the IEEE International Conference on Data Mining) in December 2006, selected the ten classical algorithms in the field of mining, C4.5 algorithm ranked first. C4.5 algorithm is a kind of classification decision tree algorithm in machine learning algorithm, its core algorithm is ID3 algorithm.

The main idea of the algorithm is to arrange the influence of the dataset according to the feature on the target index from high to low. Row into a binary tree sequence, sorted, as shown in the following figure.

The problem now is that when we have many eigenvalues, which ones are written on the node above the binary tree, which is written below. We can intuitively see that the above eigenvalue node should be a significant impact on the target index of some eigenvalue. So how do you compare which eigenvalues have a greater impact on the target index? This leads to the concept of information entropy.

One of the originator of information theory Claude E. Shannon defines information (entropy) as the probability of occurrence of discrete random events. To be clear, the greater the value of information entropy, the more confusing the information set.

The calculation formula of information entropy, (suggest to go to Wiki to study)

Here we calculate the entropy of the target exponent and the difference of entropy, that is, the gain to determine which eigenvalues have the most influence on the target index.

2. Data sets

3. Code

(1) The first part-computational entropy

The function is to find out several target indices and calculate their information entropy according to the frequency they appear.

def calcshannonent (DataSet):  
    Numentries=len (DataSet)  
          
    labelcounts={} for  
      
    Featvec in dataset:  
        CURRENTLABEL=FEATVEC[-1]  
             
        if CurrentLabel not in Labelcounts.keys ():  
            labelcounts[currentlabel]=0        
        Labelcounts[currentlabel]+=1
    shannonent=0.0 for
          
    key in labelcounts:  
               
         prob =float (labelcounts[key))/ NumEntries          
         Shannonent-=prob*math.log (prob,2) return  
      
    shannonent

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/sjjg/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.