Entropy and Gini index in decision tree

Source: Internet
Author: User

Decision tree is a very basic classification and regression method, but as the previous blog machine learning sorting algorithm: Ranknet to Lambdarank to Lambdamart in the Lambdamart algorithm, the most basic algorithm is a lot of classic, complex, The basis of an efficient machine learning algorithm. About what is a decision tree, there will be many blog posts on the Internet, so this article does not want to discuss this topic. This article is to discuss two very important decision indicators in decision trees: Entropy and Gini index. Both the entropy and the Gini index are indicators used to define the uncertainties of random variables. The following first describes what is the uncertainty of random variables.

1. Uncertainty of random variables

What is the uncertainty of random variables? For example, for example, a class has 50 classmates, each classmate has and only a smart phone, asked: if arbitrarily choose a classmate, he may use what brand of mobile phone? If all the classmates in this class use the Apple phone, then the question is very good answer, that is, "From this class arbitrarily selected a classmate with what brand of mobile phone" This random variable is determined, the uncertainty is 0. But if the class in $\frac{1}{3}$ students with Millet mobile phones, $\frac{1}{3}$ students with Apple phones, the rest of the $\frac{1}{3}$ classmate with Huawei Mobile phone, in this case, this variable uncertainty significantly increased. Then another question needs to be considered: under what circumstances does the variable have the greatest uncertainty? Intuitively, if each student uses a different brand of mobile phone, the uncertainty of this variable should be the largest. Understanding the uncertainties of random variables makes it easier to understand the definitions of the entropy and Gini index described below.

2. Definition of entropy and related proofs

In information theory and probability statistics, entropy is the most basic concept, which represents the measurement of random variable uncertainty. The $x$ is a discrete random variable with a finite value, and its probability distribution is:

$ $P (x=x_i) =p_i, i=1,2,..., n$$

The entropy of the random variable $x$ is defined as:

$ $H (X) =-\sum_{i=1}^{n}p_i\log{p_i}$$

In order to make sense, define $0\log{0}=0$. Since the definition of entropy depends only on the distribution of $x$, and is independent of the value of $x$, we can consider entropy as a function of distribution:

$ $H (p) =-\sum_{i=1}^{n}p_i\log{p_i}$$

It says that the entropy of uniform distribution is the largest, but it's just an intuitive feeling, and it's not proven. The following is proved by Lagrange multiplier method. The $h (p) $ can be rewritten according to the Lagrangian multiplier:

$ $H (p) =-\sum_{i=1}^{n}p_i\log{p_i}+\lambda \left (\sum_{i=1}^{n}p_i-1\right) $$

$H (p) $ for each $p_i$ derivative, gets:

$$\frac{\partial{h (p)}}{\partial{p_i}}=-\ln{p_i}-1+\lambda=0,i=1,2,..., n$$

By $-\ln{p_i}-1+\lambda=0$ can get $p_i=e^{\lambda-1}, i=1,2,..., n$

So it is known that $p_i$ is only $\lambda$-related values, each $p_i$ should be equal, that is, $p_1=p_2=...=p_n=\frac{1}{n}$, at which point $h (p) $ Gets the maximum value $\log{n}$. The range of entropy is $[0,\log{n}]$.

3. Definition of the Gini index and relevant proof

The Gini index is the index of selecting the optimal feature when the classic decision tree cart is used for classification problems. Assuming there are $k$ classes where the probability of a sample point belonging to class $k$ is $p_k$, the Gini index of the probability distribution is defined as:

$ $G (p) =\sum_{k=1}^{k}p_k (1-p_k) =1-\sum_{k=1}^{k}p_k^2$$

Meet the conditions $\sum_{k=1}^{k}p_k=1$

As mentioned above, the Gini index can also describe the degree of uncertainty of a random variable, so it is possible to guess: when $p_1=p_2=...=p_k=\frac{1}{k}$, $G (p) $ Gets the maximum value, at which time the random variable is the most uncertain. So, how to prove it? Here are two ways to do this.

Method 1: The Lagrange multiplier method can also be used to prove it. Depending on the nature of the Lagrangian multiplier, the $g (p) $ function is rewritten as:

$ $G (p) =1-\sum_{k=1}^{k}p_k^2+\lambda \left (\sum_{k=1}^{k}p_k-1\right) $$

$G (p) $ for each $p_i$ derivative, gets:

$$\frac{\partial{g (p)}}{\partial{p_i}}=-2p_i+\lambda=0,i=1,2,..., k$$

By $-2p_i+\lambda=0$, $p_i$ is also only associated with constant $\lambda$, so $p_1=p_2=...=p_k=\frac{1}{k}$

The range of $G (p) $ is $[0,1-\frac{1}{k}]$.

Method 2: Construct $k$ as two points in space $p_1=[p_1,p_2,..., p_k]^t$ and $p_2=[\frac{1}{k},\frac{1}{k},..., \frac{1}{k}]^t$ with an angle of $\theta$, so:

$$\cos\theta=\frac{p_1\cdot p_2}{| P_1|\cdot | p_2|} =\frac{[p_1,p_2,..., P_k]\cdot [\frac{1}{k},\frac{1}{k},..., \frac{1}{k}]}{\sqrt{p_1^2+p_2^2+...+p_k^2}\cdot \sqrt {\frac{1}{k^2}+\frac{1}{k^2}+...+\frac{1}{k^2}}} \leq{1}$$

So:

$$\sum_{k=1}^{k}p_k^2\ge \frac{(\sum_{k=1}^{k}p_k) ^2}{k}$$

So:

$ $G (p) \leq 1-\frac{(\sum_{k=1}^{k}p_k) ^2}{k}=1-\frac{1}{k}$$

The equals sign is reached at $p_1=p_2=...=p_k=\frac{1}{k}$.

Entropy and Gini index in decision tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.