Dataset division-Information Entropy

Last Update:2014-08-12 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous section, we learned KNN. The biggest drawback of KNN is that it cannot provide the internal meaning of the data. However, the advantage of using decision trees to deal with classification problems is that the data format is very easy to understand.

There are many Decision Tree algorithms, including cart, ID3 and C4.5. ID3 and C4.5 are both based on information entropy and are our learning content today.

1. Information Entropy

　　Entropy was initially used in Thermodynamic. From the second law of thermodynamic, entropy is used to measure the number of states that a system can reach. The more states a system can reach, the greater the entropy. Shannon proposed the concept of information entropy in his essay a mathematical theory of communication in 1948. Since then, information theory has also been used as a separate discipline.

Information entropy is used to measure the expected value of a random variable. The larger the information entropy of a variable, the more situations it may encounter. The smaller the information entropy, the smaller the information.

The definition of information can be understood as follows. If the transaction to be classified is divided into multiple categories, the information of the symbol Xi is defined

I (xi) =-log2p (xi) where P (xi) is the probability of selecting this classification

To calculate entropy, We need to calculate the expected values of all possible values of all categories. The following formula can be used to obtain the expected values:

Where N is the number of categories

2. Calculate information entropy

　　Here is a small example of dividing men and women by two features: whether there is a throat knot or a long beard

Use python to calculate information entropy. The program is as follows:

From math import log # create a simple dataset def creatdataset (): DataSet = [[, 'yes'], [, 'yes'], [, 'no'], [, 'no'], [, 'no'] labels = ['no surfacing', 'flippers'] Return dataset, labels # Calculate the information entropy def calcshannonent (Dataset): numentries = Len (Dataset) labelcounts ={} for VEC in Dataset: currentlabel = VEC [-1] If currentlabel not in labelcounts. keys (): # create a dictionary labelcounts [currentlabel] = 0 labelcounts [currentlabel] + = 1 shannonent = 0.0 for key in labelcounts for all possible categories: prob = float (labelcounts [Key])/numentries shannonent-= prob * log (prob, 2) return shannonent # simple test mydat, labels = creatdataset () print mydat print calcshannonent (mydat)

The test results are as follows:

[[1, 1, ‘yes‘], [1, 1, ‘yes‘], [1, 0, ‘no‘], [0, 1, ‘no‘], [0, 1, ‘no‘]]0.970950594455

After obtaining the entropy, we can divide the dataset according to the method for obtaining the maximum entropy.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More