The information and entropy calculation of decision tree

Last Update:2015-03-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Introduction

The K-Nearest neighbor algorithm mentioned earlier is the simplest and most efficient algorithm for classifying data. The K-Nearest neighbor algorithm is an instance-based learning, and we must have training sample data close to the actual data when using the algorithm. Moreover, K-neighbor data must preserve all data sets, and if the training data set is large, a large amount of storage space must be used, and the K-nearest neighbor algorithm must calculate the distance for each data in the dataset, which is time consuming. In addition, it is powerless to structure information about the data.

Another kind of classification algorithm is "decision Tree algorithm". To treat a data, decision trees use multiple decision-making decisions to determine the final classification, for example: Given an animal, judging what sort of animal, first "is it a mammal?" "If not," is it a terrestrial animal? "If not," is it an aerial animal? ”...... And so on, ultimately judging the classification of animals and determining animals.

Ii. Information and entropy

Entropy represents the degree of clutter in a system, the greater the entropy, the more cluttered the system. The classification of data in a dataset is the process of reducing the entropy of the data set.

Decision Tree algorithm is a process of dividing data sets. The principle of partitioning a dataset is to make the unordered data more orderly. We assume that the obtained data is useful information, and an effective way to process the information is to take advantage of the theory.

Information gain: The change of information before and after the data set becomes the information gain, the best choice is to obtain the highest information gain feature. So how do you calculate information gain? The method of measuring collection information is called entropy.

" If you don't understand what information gain and entropy are, don't worry--they are doomed from the day they were born to be very confusing to the world." After Claude Shannon wrote the information theory, John von Neumann suggested using the term "entropy" because everyone didn't know what it meant. "

Entropy is defined as the expectation of information, and before the concept is clarified, the definition of the information is first seen. If the transaction to be categorized may be divided into multiple classifications, the information for the symbol is defined as:

This is the probability of selecting the classification.

All possible values for all categories contain information expectations, which are calculated entropy:

Here is an example to apply the above formula and apply the formula to the actual example.

A city A simple example

The following table contains 5 marine animals, including: whether they can survive without surfacing, and whether they have fins. These animals are divided into two categories: fish and non-fish. The question to be studied is whether the data is divided according to the first feature or the second one.

	Whether you can survive without surfacing	Do you have fins again?	belong to Fish
1	Is	Is	Is
2	Is	Is	Is
3	Is	Whether	Whether
4	Whether	Is	Whether
5	Whether	Is	Whether

The Python code for the Shannon entropy is given for future use (all code is written in Python)

1 defcalcshannonent (dataSet):2NumEntries =Len (dataSet)3Labelcounts = {}4      forFeatvecinchDataSet:5CurrentLabel = featvec[-1]6     ifCurrentLabel not inchLabelcounts.keys (): Labelcounts[currentlabel] =07Labelcounts[currentlabel] + = 18Shannonent = 0.09      forKeyinchlabelcounts:TenProb = float (Labelcounts[key])/numentries OneShannonent-= prob * log (prob, 2) A     returnShannonent

If you know Python, the code is relatively simple, but to explain in advance what kind of data the dataset is and what the data structure is. This leads to the following code, which is used to generate the dataset, so you can better understand what's going on in the code "CurrentLabel = Featvec[-1]".

1 defCreateDataSet ():2DataSet = [[1, 1,'Yes'],3[1, 1,'Yes'],4[1, 0,'No'],5[0, 1,'No'],6[0, 1,'No']]7Labels = ['No surfacing','Flippers']8     returnDataSet, Labels

The data we are dealing with is a dataset such as DataSet, where each data is a list type, and the last item of data is the label of the data. Look at the effect:

The higher the entropy, the higher the data mixing degree, the more the data category can observe the change of entropy.

What do we do next? Don't forget the original idea: partition data based on the first feature or the second feature? The answer to this question lies in the fact that the dividing entropy of the feature is smaller. We will calculate the entropy of information once for each feature partition dataset, and then decide which feature is the best way to divide the data set.

First, you write a function that divides the dataset by a given feature:

1 defSplitdataset (dataSet, axis, value):2Retdataset = []3      forFeatvecinchDataSet:4         ifFeatvec[axis] = =Value:5Reducedfeatvec =Featvec[:axis]6Reducedfeatvec.extend (featvec[axis+1:])7 retdataset.append (Reducedfeatvec)8     returnRetdataset

The code uses two of the methods that are in Python: Extend (), append (), which are similar, but when working with multiple lists, the two methods are completely different, and this is a bit of a self-Baidu. Code better understand, suddenly did not understand also nothing, slowly, first look at the effect of running, perceptual:

The last function is to calculate the entropy of the information once for each feature partition, and then to determine which feature is the best way to divide the data set:

1 defChoosebestfeaturetosplit (dataSet):2Numfeatures = Len (dataset[0])-13Baseentropy =calcshannonent (DataSet)4Bestinfogain = 0.0; Bestfeature =-15      forIinchRange (numfeatures):6Featlist = [Example[i] forExampleinchDataSet]7Uniquevals =Set (featlist)8Newentropy = 0.09          forValueinchuniquevals:TenSubdataset =Splitdataset (DataSet, I, value) OneProb = Len (subdataset)/float (len (dataSet)) ANewentropy + = prob *calcshannonent (subdataset) -Infogain = baseentropy-newentropy -         if(Infogain >bestinfogain): theBestinfogain =Infogain -Bestfeature =I -     returnBestfeature

It can be seen that the best division is obtained according to the first characteristic, and the entropy is the least.

The information and entropy calculation of decision tree

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The information and entropy calculation of decision tree

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The information and entropy calculation of decision tree

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support