Machine Learning Basics (iii)--information, information entropy and information gain

Source: Internet
Author: User

Information: Information, information entropy: Information entropy, information gain: Information gain (IG)

The big rule of dividing a dataset is to make the unordered data more orderly. One way to organize disorganized data is to use information theory as a branch of information processing.

The change in information after the partitioning of the data is called information gain, and knowing how to calculate the information gain, we can calculate the information gain from each eigenvalue partition data set, and the feature that obtains the highest information gain is the best choice. Definition of information and information entropy

If the data set to be categorized may be divided among multiple classifications, the information for Category XI X_i is defined as:

L (xi) =−log2p (xi) L (x_i) =-\log_2p (x_i)
where P (xi) p (x_i) is the proportion of samples in this category;

In order to calculate entropy, we need to calculate all the possible information expectations of all categories (known by the expected formula of discrete random variables),
H=−∑I=1NP (xi) log2p (xi) H=-\SUM_{I=1}^NP (x_i) \log_2 p (x_i)
The sum of the traversal multiplication can be calculated using the inner product entropy.

Information entropy, which is used to measure the degree of disorder of information (the greater the entropy, the more disordered, equals 0 o'clock, means that all categories are the same, fully ordered)

the properties of entropy: (1) non-negative, 0<p (xi) ≤1→log2p (xi) ≤0 0 The Shannon entropy and the best partitioning characteristics of the computed data set

Calculates the Shannon entropy of a dataset based on the category of the dataset:

From collections import Counter from
Math import log

def calcshannonent (DataSet):
    classcnt = [Sample[-1] for Sample in DataSet]
    n = len (DataSet)
    classcnt = Counter (classcnt)
    ent = 0.
    For times in Classcnt.values ():
        ent-= Times/n*log (times/n, 2)
    return ENT

To divide a dataset by a given feature (attribute column):

# The third parameter Val is not manually specified,
# The function is not passed directly to the external tune, but is called by other functions
# inside the function, that is, when iterating over property values that are not duplicates of the attribute column, pass in the Val value
def splitdataset ( DataSet, Axis, Val):
    spliteddataset = [] for the
    sample in DataSet:
        if sample[axis] = = val:
            Spliteddataset.append (sample[:axis]+sample[axis+1:])
    return Spliteddataset

Choosing the best way to partition your data sets, or finding the best attribute columns, obviously requires traversing the attribute columns to find the maximum information gain:

def choosebestfeattosplit (DataSet): Baseent = calcshannonent (DataSet) Bestinfogain, BestF  Eat = 0., 1 for J in Range (Len (dataset[0))-1): Featcol = [sample[j] for sample in DataSet] Uniqfeat =
        Set (featcol) newent = 0. For val in unifeat:subdataset = Splitdataset (DataSet, J, val) newent = Len (subdataset)/len (datase T) *calcshannonent (subdataset) Infogain = baseent-newent if Bestinfo < Infogain:bestinfo = Infogain Bestfeast = j return Bestfeat 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.