First, Introduction
The K-Nearest neighbor algorithm mentioned earlier is the simplest and most efficient algorithm for classifying data. The K-Nearest neighbor algorithm is an instance-based learning, and we must have training sample data close to the actual data when using the algorithm. Moreover, K-neighbor data must preserve all data sets, and if the training data set is large, a large amount of storage space must be used, and the K-nearest neighbor algorithm must calculate the distance for each data in the dataset, which is time consuming. In addition, it is powerless to structure information about the data.
Another kind of classification algorithm is "decision Tree algorithm". To treat a data, decision trees use multiple decision-making decisions to determine the final classification, for example: Given an animal, judging what sort of animal, first "is it a mammal?" "If not," is it a terrestrial animal? "If not," is it an aerial animal? ”...... And so on, ultimately judging the classification of animals and determining animals.
Ii. Information and entropy
Entropy represents the degree of clutter in a system, the greater the entropy, the more cluttered the system. The classification of data in a dataset is the process of reducing the entropy of the data set.
Decision Tree algorithm is a process of dividing data sets. The principle of partitioning a dataset is to make the unordered data more orderly. We assume that the obtained data is useful information, and an effective way to process the information is to take advantage of the theory.
Information gain: The change of information before and after the data set becomes the information gain, the best choice is to obtain the highest information gain feature. So how do you calculate information gain? The method of measuring collection information is called entropy.
" If you don't understand what information gain and entropy are, don't worry--they are doomed from the day they were born to be very confusing to the world." After Claude Shannon wrote the information theory, John von Neumann suggested using the term "entropy" because everyone didn't know what it meant. "
Entropy is defined as the expectation of information, and before the concept is clarified, the definition of the information is first seen. If the transaction to be categorized may be divided into multiple classifications, the information for the symbol is defined as:
This is the probability of selecting the classification.
All possible values for all categories contain information expectations, which are calculated entropy:
Here is an example to apply the above formula and apply the formula to the actual example.
A city A simple example
The following table contains 5 marine animals, including: whether they can survive without surfacing, and whether they have fins. These animals are divided into two categories: fish and non-fish. The question to be studied is whether the data is divided according to the first feature or the second one.
|
Whether you can survive without surfacing |
Do you have fins again? |
belong to Fish |
1 |
Is |
Is |
Is |
2 |
Is |
Is |
Is |
3 |
Is |
Whether |
Whether |
4 |
Whether |
Is |
Whether |
5 |
Whether |
Is |
Whether |
The Python code for the Shannon entropy is given for future use (all code is written in Python)
1 defcalcshannonent (dataSet):2NumEntries =Len (dataSet)3Labelcounts = {}4 forFeatvecinchDataSet:5CurrentLabel = featvec[-1]6 ifCurrentLabel not inchLabelcounts.keys (): Labelcounts[currentlabel] =07Labelcounts[currentlabel] + = 18Shannonent = 0.09 forKeyinchlabelcounts:TenProb = float (Labelcounts[key])/numentries OneShannonent-= prob * log (prob, 2) A returnShannonent
If you know Python, the code is relatively simple, but to explain in advance what kind of data the dataset is and what the data structure is. This leads to the following code, which is used to generate the dataset, so you can better understand what's going on in the code "CurrentLabel = Featvec[-1]".
1 defCreateDataSet ():2DataSet = [[1, 1,'Yes'],3[1, 1,'Yes'],4[1, 0,'No'],5[0, 1,'No'],6[0, 1,'No']]7Labels = ['No surfacing','Flippers']8 returnDataSet, Labels
The data we are dealing with is a dataset such as DataSet, where each data is a list type, and the last item of data is the label of the data. Look at the effect:
The higher the entropy, the higher the data mixing degree, the more the data category can observe the change of entropy.
What do we do next? Don't forget the original idea: partition data based on the first feature or the second feature? The answer to this question lies in the fact that the dividing entropy of the feature is smaller. We will calculate the entropy of information once for each feature partition dataset, and then decide which feature is the best way to divide the data set.
First, you write a function that divides the dataset by a given feature:
1 defSplitdataset (dataSet, axis, value):2Retdataset = []3 forFeatvecinchDataSet:4 ifFeatvec[axis] = =Value:5Reducedfeatvec =Featvec[:axis]6Reducedfeatvec.extend (featvec[axis+1:])7 retdataset.append (Reducedfeatvec)8 returnRetdataset
The code uses two of the methods that are in Python: Extend (), append (), which are similar, but when working with multiple lists, the two methods are completely different, and this is a bit of a self-Baidu. Code better understand, suddenly did not understand also nothing, slowly, first look at the effect of running, perceptual:
The last function is to calculate the entropy of the information once for each feature partition, and then to determine which feature is the best way to divide the data set:
1 defChoosebestfeaturetosplit (dataSet):2Numfeatures = Len (dataset[0])-13Baseentropy =calcshannonent (DataSet)4Bestinfogain = 0.0; Bestfeature =-15 forIinchRange (numfeatures):6Featlist = [Example[i] forExampleinchDataSet]7Uniquevals =Set (featlist)8Newentropy = 0.09 forValueinchuniquevals:TenSubdataset =Splitdataset (DataSet, I, value) OneProb = Len (subdataset)/float (len (dataSet)) ANewentropy + = prob *calcshannonent (subdataset) -Infogain = baseentropy-newentropy - if(Infogain >bestinfogain): theBestinfogain =Infogain -Bestfeature =I - returnBestfeature
It can be seen that the best division is obtained according to the first characteristic, and the entropy is the least.
The information and entropy calculation of decision tree