Information: Information, information entropy: Information entropy, information gain: Information gain (IG)
The big rule of dividing a dataset is to make the unordered data more orderly. One way to organize disorganized data is to use information theory as a branch of information processing.
The change in information after the partitioning of the data is called information gain, and knowing how to calculate the information gain, we can calculate the information gain from each eigenvalue partition data set, and the feature that obtains the highest information gain is the best choice. Definition of information and information entropy
If the data set to be categorized may be divided among multiple classifications, the information for Category XI X_i is defined as:
L (xi) =−log2p (xi) L (x_i) =-\log_2p (x_i)
where P (xi) p (x_i) is the proportion of samples in this category;
In order to calculate entropy, we need to calculate all the possible information expectations of all categories (known by the expected formula of discrete random variables),
H=−∑I=1NP (xi) log2p (xi) H=-\SUM_{I=1}^NP (x_i) \log_2 p (x_i)
The sum of the traversal multiplication can be calculated using the inner product entropy.
Information entropy, which is used to measure the degree of disorder of information (the greater the entropy, the more disordered, equals 0 o'clock, means that all categories are the same, fully ordered)
the properties of entropy: (1) non-negative, 0<p (xi) ≤1→log2p (xi) ≤0 0 The Shannon entropy and the best partitioning characteristics of the computed data set
Calculates the Shannon entropy of a dataset based on the category of the dataset:
From collections import Counter from
Math import log
def calcshannonent (DataSet):
classcnt = [Sample[-1] for Sample in DataSet]
n = len (DataSet)
classcnt = Counter (classcnt)
ent = 0.
For times in Classcnt.values ():
ent-= Times/n*log (times/n, 2)
return ENT
To divide a dataset by a given feature (attribute column):
# The third parameter Val is not manually specified,
# The function is not passed directly to the external tune, but is called by other functions
# inside the function, that is, when iterating over property values that are not duplicates of the attribute column, pass in the Val value
def splitdataset ( DataSet, Axis, Val):
spliteddataset = [] for the
sample in DataSet:
if sample[axis] = = val:
Spliteddataset.append (sample[:axis]+sample[axis+1:])
return Spliteddataset
Choosing the best way to partition your data sets, or finding the best attribute columns, obviously requires traversing the attribute columns to find the maximum information gain:
def choosebestfeattosplit (DataSet): Baseent = calcshannonent (DataSet) Bestinfogain, BestF Eat = 0., 1 for J in Range (Len (dataset[0))-1): Featcol = [sample[j] for sample in DataSet] Uniqfeat =
Set (featcol) newent = 0. For val in unifeat:subdataset = Splitdataset (DataSet, J, val) newent = Len (subdataset)/len (datase T) *calcshannonent (subdataset) Infogain = baseent-newent if Bestinfo < Infogain:bestinfo = Infogain Bestfeast = j return Bestfeat