In the previous section, we learned KNN. The biggest drawback of KNN is that it cannot provide the internal meaning of the data. However, the advantage of using decision trees to deal with classification problems is that the data format is very easy to understand.
There are many Decision Tree algorithms, including cart, ID3 and C4.5. ID3 and C4.5 are both based on information entropy and are our learning content today.
1. Information Entropy
Entropy was initially used in Thermodynamic. From the second law of thermodynamic, entropy is used to measure the number of states that a system can reach. The more states a system can reach, the greater the entropy. Shannon proposed the concept of information entropy in his essay a mathematical theory of communication in 1948. Since then, information theory has also been used as a separate discipline.
Information entropy is used to measure the expected value of a random variable. The larger the information entropy of a variable, the more situations it may encounter. The smaller the information entropy, the smaller the information.
The definition of information can be understood as follows. If the transaction to be classified is divided into multiple categories, the information of the symbol Xi is defined
I (xi) =-log2p (xi) where P (xi) is the probability of selecting this classification
To calculate entropy, We need to calculate the expected values of all possible values of all categories. The following formula can be used to obtain the expected values:
Where N is the number of categories
2. Calculate information entropy
Here is a small example of dividing men and women by two features: whether there is a throat knot or a long beard
Use python to calculate information entropy. The program is as follows:
From math import log # create a simple dataset def creatdataset (): DataSet = [[, 'yes'], [, 'yes'], [, 'no'], [, 'no'], [, 'no'] labels = ['no surfacing', 'flippers'] Return dataset, labels # Calculate the information entropy def calcshannonent (Dataset): numentries = Len (Dataset) labelcounts ={} for VEC in Dataset: currentlabel = VEC [-1] If currentlabel not in labelcounts. keys (): # create a dictionary labelcounts [currentlabel] = 0 labelcounts [currentlabel] + = 1 shannonent = 0.0 for key in labelcounts for all possible categories: prob = float (labelcounts [Key])/numentries shannonent-= prob * log (prob, 2) return shannonent # simple test mydat, labels = creatdataset () print mydat print calcshannonent (mydat)
The test results are as follows:
[[1, 1, ‘yes‘], [1, 1, ‘yes‘], [1, 0, ‘no‘], [0, 1, ‘no‘], [0, 1, ‘no‘]]0.970950594455
After obtaining the entropy, we can divide the dataset according to the method for obtaining the maximum entropy.