Feature Engineering: Information entropy, information gain, information gain rate

Source: Internet
Author: User
information gain and feature engineering

One of the key components of feature engineering (feature Engineering) is Feature selection (feature selection). Feature selection is an important data preprocessing process. The process of selecting a subset of related features (relevant feature) from a given set of features is called Feature selection.
The process of feature selection can be composed of "subset search" and "subset Evaluation" (subset evaluation). Subset. Simply put, the process is to produce a "candidate subset", evaluate its good or bad, based on the evaluation results to produce the next candidate subset, and then evaluate it. Continue this process until a better candidate subset is not found.
Subset search methods include forward search, back search, bidirectional search, and so on.
The method of evaluation is more common in evaluation mode, which is based on the information gain, this article will introduce this way, and give an example. Data

Outlook Temperature Humidity Windy Play?
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rain Mild High False Yes
Rain Cool Normal False Yes
Rain Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rain Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rain Mild High True No

Weather Forecast Data Example information entropy

The formula for information entropy is:
ENT (x) =−∑I=1NP (xi) LOGBP (xi) Ent (x) =-\sum_{i=1}^{n}p (X_{i}) log_{b}p (X_{i})
This defines the data set as D D and the information entropy of the original data set is:
ENT (d) =−514∗log2514−914∗log2914 Ent (d) =-\frac{5}{14}*log_{2}\frac{5}{14}-\frac{9}{14}*log_{2}\frac{9}{14}

Information Gain

Formula for information gain:
Gain (A) =ent (D) −∑v=1v| dv| | d| ENT (Dv) Gain (A) = ent (D)-\sum_{v=1}^{v}\frac{| d^{v}|} {| d|} Ent (D^{v})
Assuming that you select an Outlook feature, you are now divided into three subsets based on this feature D:
(D1|outlook=sunny) (D2|outlo

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.