Machine Learning Basics (iii)--information, information entropy and information gain

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Information: Information, information entropy: Information entropy, information gain: Information gain (IG)

The big rule of dividing a dataset is to make the unordered data more orderly. One way to organize disorganized data is to use information theory as a branch of information processing.

The change in information after the partitioning of the data is called information gain, and knowing how to calculate the information gain, we can calculate the information gain from each eigenvalue partition data set, and the feature that obtains the highest information gain is the best choice. Definition of information and information entropy

If the data set to be categorized may be divided among multiple classifications, the information for Category XI X_i is defined as:

L (xi) =−log2p (xi) L (x_i) =-\log_2p (x_i)
where P (xi) p (x_i) is the proportion of samples in this category;

In order to calculate entropy, we need to calculate all the possible information expectations of all categories (known by the expected formula of discrete random variables),
H=−∑I=1NP (xi) log2p (xi) H=-\SUM_{I=1}^NP (x_i) \log_2 p (x_i)
The sum of the traversal multiplication can be calculated using the inner product entropy.

Information entropy, which is used to measure the degree of disorder of information (the greater the entropy, the more disordered, equals 0 o'clock, means that all categories are the same, fully ordered)

the properties of entropy: (1) non-negative, 0<p (xi) ≤1→log2p (xi) ≤0 0 The Shannon entropy and the best partitioning characteristics of the computed data set

Calculates the Shannon entropy of a dataset based on the category of the dataset:

From collections import Counter from
Math import log

def calcshannonent (DataSet):
    classcnt = [Sample[-1] for Sample in DataSet]
    n = len (DataSet)
    classcnt = Counter (classcnt)
    ent = 0.
    For times in Classcnt.values ():
        ent-= Times/n*log (times/n, 2)
    return ENT

To divide a dataset by a given feature (attribute column):

# The third parameter Val is not manually specified,
# The function is not passed directly to the external tune, but is called by other functions
# inside the function, that is, when iterating over property values that are not duplicates of the attribute column, pass in the Val value
def splitdataset ( DataSet, Axis, Val):
    spliteddataset = [] for the
    sample in DataSet:
        if sample[axis] = = val:
            Spliteddataset.append (sample[:axis]+sample[axis+1:])
    return Spliteddataset

Choosing the best way to partition your data sets, or finding the best attribute columns, obviously requires traversing the attribute columns to find the maximum information gain:

def choosebestfeattosplit (DataSet): Baseent = calcshannonent (DataSet) Bestinfogain, BestF  Eat = 0., 1 for J in Range (Len (dataset[0))-1): Featcol = [sample[j] for sample in DataSet] Uniqfeat =
        Set (featcol) newent = 0. For val in unifeat:subdataset = Splitdataset (DataSet, J, val) newent = Len (subdataset)/len (datase T) *calcshannonent (subdataset) Infogain = baseent-newent if Bestinfo < Infogain:bestinfo = Infogain Bestfeast = j return Bestfeat

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine Learning Basics (iii)--information, information entropy and information gain

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine Learning Basics (iii)--information, information entropy and information gain

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support