Decision tree-Entropy, the calculation of information gain

Source: Internet
Author: User

The story starts with a math problem in elementary school.

"Daddy, Why is the Panda 3 not 11"

"Honey, you haven't learned the binary, okay?"

The above story is purely fictitious, the real dialogue is actually like this

"Dad, why 3:4 little"

"Baby, just a few minutes to know. You see, pigs have a few. 3, birds have 1,2,3,4. 4. Do you see a bird more than a pig? So 3:4 small "

Why do we have to use decimal? We certainly understand that the decimal is to describe the world as a language that is used in mathematics and then communicated, if you use decimal I use binary, that can not communicate is not?

The decision tree is used more, and the information gain is used as the index of feature selection, and the information gain is the entropy difference from the former entropy.

Why use entropy? Excuse me, the question is, scientists, you just use it.

Why is it that the information gain is much more chaotic? Next we'll count the verification.

Let's start with a simple, rough one:

To classify three balls, one can clearly see the red ball alone, a group of black balls. What about the information gain of the specific classification?

Before calculating entropy in Excel

E (three balls) =-1/3 * log (1/3, 2)-2/3 * log (2/3,2) = 0.918

The first kind of division is a group of red and black balls, and a group of black balls.  In the red and black group there are red balls and black balls, red and black ball each occurrence probability is a. In another group only 100% appeared black ball, the probability of the red ball was 0

So e (Red black | black) = e (red black) + E (black) =-* LOG (2)-* LOG (2)-1 * log (1, 2) = 1

The second method is the red ball itself a group, in the Red ball group appears the probability of black ball is 0, in the Black Ball group, the probability of red ball is 0, such classification has been "pure". You can still calculate the entropy:

E (Red | black) = e (red) + E (black black) =-1 * log (1, 2)-1 * log (1, 2) = 0

So

Red and black mixed information gain G (Red black | black) = e (three balls)-E (Red black | black) = 0.918-1 = -0.02

Red and black separate information gain G (red | black black) = e (three balls)-E (red | black) = 0.918-0 = 0.918

Therefore, the large gain of red | Black and Black Group is better.

Note : The log (0, 2) = 0 is specified, although this is mathematically not true

The warm-up is over, a little more complicated:

Before you start, always focus on the red and blue two colors, it is our classification target

E (beforesplit) =-4/6 * log (4/6, 2)-2/6 * log (2/6, 2) = 0.918

If you select a shape category, the result is this (always focus on the red and blue two colors):

E (N1) = 0

E (N2) =-2/3 * log (2/3,2)-1/3 * log (1/3, 2) = 0.918

Information gain G (Shape) = 0.918-0.918-0 = 0

According to the expression classification, the result is this (please always focus on the red and blue two colors):

E (N1) = 0

E (N2) = -1/2 * LOG (2)-2 * log (1/2,) = 1

Information gain G (Shape) = 0.918-1-0 = -0.02

If you select a line (thickness) category, the result is this (always focus on the red and blue two colors):

E (N1) = -1/2 * LOG (2)-2 * log (1/2,) = 1

E (N2) = -1/4 * LOG (2)-3/4 * log (3/4, 2) = 0.811

Information gain G (Bold) = 0.918-1-0.811 =-0.893

Therefore, the preferred classification is by shape.

It can also be seen that in extreme cases, the information gain has a tendency toward the node entropy of 0, but this is not necessarily the best choice.

10 binary not all fields apply, right?

^_^

Decision tree-entropy, information gain calculation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.