Entropy and Gini index in decision tree

Last Update:2018-10-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Decision tree is a very basic classification and regression method, but as the previous blog machine learning sorting algorithm: Ranknet to Lambdarank to Lambdamart in the Lambdamart algorithm, the most basic algorithm is a lot of classic, complex, The basis of an efficient machine learning algorithm. About what is a decision tree, there will be many blog posts on the Internet, so this article does not want to discuss this topic. This article is to discuss two very important decision indicators in decision trees: Entropy and Gini index. Both the entropy and the Gini index are indicators used to define the uncertainties of random variables. The following first describes what is the uncertainty of random variables.

1. Uncertainty of random variables

What is the uncertainty of random variables? For example, for example, a class has 50 classmates, each classmate has and only a smart phone, asked: if arbitrarily choose a classmate, he may use what brand of mobile phone? If all the classmates in this class use the Apple phone, then the question is very good answer, that is, "From this class arbitrarily selected a classmate with what brand of mobile phone" This random variable is determined, the uncertainty is 0. But if the class in $\frac{1}{3}$ students with Millet mobile phones, $\frac{1}{3}$ students with Apple phones, the rest of the $\frac{1}{3}$ classmate with Huawei Mobile phone, in this case, this variable uncertainty significantly increased. Then another question needs to be considered: under what circumstances does the variable have the greatest uncertainty? Intuitively, if each student uses a different brand of mobile phone, the uncertainty of this variable should be the largest. Understanding the uncertainties of random variables makes it easier to understand the definitions of the entropy and Gini index described below.

2. Definition of entropy and related proofs

In information theory and probability statistics, entropy is the most basic concept, which represents the measurement of random variable uncertainty. The $x$ is a discrete random variable with a finite value, and its probability distribution is:

$ $P (x=x_i) =p_i, i=1,2,..., n$$

The entropy of the random variable $x$ is defined as:

$ $H (X) =-\sum_{i=1}^{n}p_i\log{p_i}$$

In order to make sense, define $0\log{0}=0$. Since the definition of entropy depends only on the distribution of $x$, and is independent of the value of $x$, we can consider entropy as a function of distribution:

$ $H (p) =-\sum_{i=1}^{n}p_i\log{p_i}$$

It says that the entropy of uniform distribution is the largest, but it's just an intuitive feeling, and it's not proven. The following is proved by Lagrange multiplier method. The $h (p) $ can be rewritten according to the Lagrangian multiplier:

$ $H (p) =-\sum_{i=1}^{n}p_i\log{p_i}+\lambda \left (\sum_{i=1}^{n}p_i-1\right) $$

$H (p) $ for each $p_i$ derivative, gets:

$$\frac{\partial{h (p)}}{\partial{p_i}}=-\ln{p_i}-1+\lambda=0,i=1,2,..., n$$

By $-\ln{p_i}-1+\lambda=0$ can get $p_i=e^{\lambda-1}, i=1,2,..., n$

So it is known that $p_i$ is only $\lambda$-related values, each $p_i$ should be equal, that is, $p_1=p_2=...=p_n=\frac{1}{n}$, at which point $h (p) $ Gets the maximum value $\log{n}$. The range of entropy is $[0,\log{n}]$.

3. Definition of the Gini index and relevant proof

The Gini index is the index of selecting the optimal feature when the classic decision tree cart is used for classification problems. Assuming there are $k$ classes where the probability of a sample point belonging to class $k$ is $p_k$, the Gini index of the probability distribution is defined as:

$ $G (p) =\sum_{k=1}^{k}p_k (1-p_k) =1-\sum_{k=1}^{k}p_k^2$$

Meet the conditions $\sum_{k=1}^{k}p_k=1$

As mentioned above, the Gini index can also describe the degree of uncertainty of a random variable, so it is possible to guess: when $p_1=p_2=...=p_k=\frac{1}{k}$, $G (p) $ Gets the maximum value, at which time the random variable is the most uncertain. So, how to prove it? Here are two ways to do this.

Method 1: The Lagrange multiplier method can also be used to prove it. Depending on the nature of the Lagrangian multiplier, the $g (p) $ function is rewritten as:

$ $G (p) =1-\sum_{k=1}^{k}p_k^2+\lambda \left (\sum_{k=1}^{k}p_k-1\right) $$

$G (p) $ for each $p_i$ derivative, gets:

$$\frac{\partial{g (p)}}{\partial{p_i}}=-2p_i+\lambda=0,i=1,2,..., k$$

By $-2p_i+\lambda=0$, $p_i$ is also only associated with constant $\lambda$, so $p_1=p_2=...=p_k=\frac{1}{k}$

The range of $G (p) $ is $[0,1-\frac{1}{k}]$.

Method 2: Construct $k$ as two points in space $p_1=[p_1,p_2,..., p_k]^t$ and $p_2=[\frac{1}{k},\frac{1}{k},..., \frac{1}{k}]^t$ with an angle of $\theta$, so:

$$\cos\theta=\frac{p_1\cdot p_2}{| P_1|\cdot | p_2|} =\frac{[p_1,p_2,..., P_k]\cdot [\frac{1}{k},\frac{1}{k},..., \frac{1}{k}]}{\sqrt{p_1^2+p_2^2+...+p_k^2}\cdot \sqrt {\frac{1}{k^2}+\frac{1}{k^2}+...+\frac{1}{k^2}}} \leq{1}$$

So:

$$\sum_{k=1}^{k}p_k^2\ge \frac{(\sum_{k=1}^{k}p_k) ^2}{k}$$

So:

$ $G (p) \leq 1-\frac{(\sum_{k=1}^{k}p_k) ^2}{k}=1-\frac{1}{k}$$

The equals sign is reached at $p_1=p_2=...=p_k=\frac{1}{k}$.

Entropy and Gini index in decision tree

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Entropy and Gini index in decision tree

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Entropy and Gini index in decision tree

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support