information gain and feature engineering
One of the key components of feature engineering (feature Engineering) is Feature selection (feature selection). Feature selection is an important data preprocessing process. The process of selecting a subset of related features (relevant feature) from a given set of features is called Feature selection.
The process of feature selection can be composed of "subset search" and "subset Evaluation" (subset evaluation). Subset. Simply put, the process is to produce a "candidate subset", evaluate its good or bad, based on the evaluation results to produce the next candidate subset, and then evaluate it. Continue this process until a better candidate subset is not found.
Subset search methods include forward search, back search, bidirectional search, and so on.
The method of evaluation is more common in evaluation mode, which is based on the information gain, this article will introduce this way, and give an example. Data
Outlook |
Temperature |
Humidity |
Windy |
Play? |
Sunny |
Hot |
High |
False |
No |
Sunny |
Hot |
High |
True |
No |
Overcast |
Hot |
High |
False |
Yes |
Rain |
Mild |
High |
False |
Yes |
Rain |
Cool |
Normal |
False |
Yes |
Rain |
Cool |
Normal |
True |
No |
Overcast |
Cool |
Normal |
True |
Yes |
Sunny |
Mild |
High |
False |
No |
Sunny |
Cool |
Normal |
False |
Yes |
Rain |
Mild |
Normal |
False |
Yes |
Sunny |
Mild |
Normal |
True |
Yes |
Overcast |
Mild |
High |
True |
Yes |
Overcast |
Hot |
Normal |
False |
Yes |
Rain |
Mild |
High |
True |
No |
Weather Forecast Data Example information entropy
The formula for information entropy is:
ENT (x) =−∑I=1NP (xi) LOGBP (xi) Ent (x) =-\sum_{i=1}^{n}p (X_{i}) log_{b}p (X_{i})
This defines the data set as D D and the information entropy of the original data set is:
ENT (d) =−514∗log2514−914∗log2914 Ent (d) =-\frac{5}{14}*log_{2}\frac{5}{14}-\frac{9}{14}*log_{2}\frac{9}{14}
Information Gain
Formula for information gain:
Gain (A) =ent (D) −∑v=1v| dv| | d| ENT (Dv) Gain (A) = ent (D)-\sum_{v=1}^{v}\frac{| d^{v}|} {| d|} Ent (D^{v})
Assuming that you select an Outlook feature, you are now divided into three subsets based on this feature D:
(D1|outlook=sunny) (D2|outlo