Dataset division-Information Entropy

Source: Internet
Author: User
Tags id3

In the previous section, we learned KNN. The biggest drawback of KNN is that it cannot provide the internal meaning of the data. However, the advantage of using decision trees to deal with classification problems is that the data format is very easy to understand.

There are many Decision Tree algorithms, including cart, ID3 and C4.5. ID3 and C4.5 are both based on information entropy and are our learning content today.

1. Information Entropy

  Entropy was initially used in Thermodynamic. From the second law of thermodynamic, entropy is used to measure the number of states that a system can reach. The more states a system can reach, the greater the entropy. Shannon proposed the concept of information entropy in his essay a mathematical theory of communication in 1948. Since then, information theory has also been used as a separate discipline.

Information entropy is used to measure the expected value of a random variable. The larger the information entropy of a variable, the more situations it may encounter. The smaller the information entropy, the smaller the information.

The definition of information can be understood as follows. If the transaction to be classified is divided into multiple categories, the information of the symbol Xi is defined

I (xi) =-log2p (xi) where P (xi) is the probability of selecting this classification

To calculate entropy, We need to calculate the expected values of all possible values of all categories. The following formula can be used to obtain the expected values:

Where N is the number of categories

2. Calculate information entropy

  Here is a small example of dividing men and women by two features: whether there is a throat knot or a long beard

  

Use python to calculate information entropy. The program is as follows:

From math import log # create a simple dataset def creatdataset (): DataSet = [[, 'yes'], [, 'yes'], [, 'no'], [, 'no'], [, 'no'] labels = ['no surfacing', 'flippers'] Return dataset, labels # Calculate the information entropy def calcshannonent (Dataset): numentries = Len (Dataset) labelcounts ={} for VEC in Dataset: currentlabel = VEC [-1] If currentlabel not in labelcounts. keys (): # create a dictionary labelcounts [currentlabel] = 0 labelcounts [currentlabel] + = 1 shannonent = 0.0 for key in labelcounts for all possible categories: prob = float (labelcounts [Key])/numentries shannonent-= prob * log (prob, 2) return shannonent # simple test mydat, labels = creatdataset () print mydat print calcshannonent (mydat)

The test results are as follows:

[[1, 1, ‘yes‘], [1, 1, ‘yes‘], [1, 0, ‘no‘], [0, 1, ‘no‘], [0, 1, ‘no‘]]0.970950594455

After obtaining the entropy, we can divide the dataset according to the method for obtaining the maximum entropy.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.