The story starts with a math problem in elementary school.
"Daddy, Why is the Panda 3 not 11"
"Honey, you haven't learned the binary, okay?"
The above story is purely fictitious, the real dialogue is actually like this
"Dad, why 3:4 little"
"Baby, just a few minutes to know. You see, pigs have a few. 3, birds have 1,2,3,4. 4. Do you see a bird more than a pig? So 3:4 small "
Why do we have to use decimal? We certainly understand that the decimal is to describe the world as a language that is used in mathematics and then communicated, if you use decimal I use binary, that can not communicate is not?
The decision tree is used more, and the information gain is used as the index of feature selection, and the information gain is the entropy difference from the former entropy.
Why use entropy? Excuse me, the question is, scientists, you just use it.
Why is it that the information gain is much more chaotic? Next we'll count the verification.
Let's start with a simple, rough one:
To classify three balls, one can clearly see the red ball alone, a group of black balls. What about the information gain of the specific classification?
Before calculating entropy in Excel
E (three balls) =-1/3 * log (1/3, 2)-2/3 * log (2/3,2) = 0.918
The first kind of division is a group of red and black balls, and a group of black balls. In the red and black group there are red balls and black balls, red and black ball each occurrence probability is a. In another group only 100% appeared black ball, the probability of the red ball was 0
So e (Red black | black) = e (red black) + E (black) =-* LOG (2)-* LOG (2)-1 * log (1, 2) = 1
The second method is the red ball itself a group, in the Red ball group appears the probability of black ball is 0, in the Black Ball group, the probability of red ball is 0, such classification has been "pure". You can still calculate the entropy:
E (Red | black) = e (red) + E (black black) =-1 * log (1, 2)-1 * log (1, 2) = 0
So
Red and black mixed information gain G (Red black | black) = e (three balls)-E (Red black | black) = 0.918-1 = -0.02
Red and black separate information gain G (red | black black) = e (three balls)-E (red | black) = 0.918-0 = 0.918
Therefore, the large gain of red | Black and Black Group is better.
Note : The log (0, 2) = 0 is specified, although this is mathematically not true
The warm-up is over, a little more complicated:
Before you start, always focus on the red and blue two colors, it is our classification target
E (beforesplit) =-4/6 * log (4/6, 2)-2/6 * log (2/6, 2) = 0.918
If you select a shape category, the result is this (always focus on the red and blue two colors):
E (N1) = 0
E (N2) =-2/3 * log (2/3,2)-1/3 * log (1/3, 2) = 0.918
Information gain G (Shape) = 0.918-0.918-0 = 0
According to the expression classification, the result is this (please always focus on the red and blue two colors):
E (N1) = 0
E (N2) = -1/2 * LOG (2)-2 * log (1/2,) = 1
Information gain G (Shape) = 0.918-1-0 = -0.02
If you select a line (thickness) category, the result is this (always focus on the red and blue two colors):
E (N1) = -1/2 * LOG (2)-2 * log (1/2,) = 1
E (N2) = -1/4 * LOG (2)-3/4 * log (3/4, 2) = 0.811
Information gain G (Bold) = 0.918-1-0.811 =-0.893
Therefore, the preferred classification is by shape.
It can also be seen that in extreme cases, the information gain has a tendency toward the node entropy of 0, but this is not necessarily the best choice.
10 binary not all fields apply, right?
^_^
Decision tree-entropy, information gain calculation