Information Theory-Basic Knowledge

Last Update:2014-10-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How much information does a discrete random variable X bring to us when we observe its value? This amount of information can be seen as the surprise we have observed from the value of X. We are told that a very unlikely event has occurred much more than a very likely event. Therefore, the amount of information depends on the probability distribution p (x), so we can use a function about p (x) to model the amount of information h (x ). so what function models are suitable for expression?
We observe two mutually independent events X and Y, And we observe the amount of information it gets, which is equal to the sum of the information they get separately. That is
H (x, y) = h (x) + H (y)
The probability relationship between two independent time X and Y:
P (x, y) = p (x) P (y)
Based on the above observation, the amount of information must be related to the log function of P (x.
So we get:

Adding a negative number ensures that the amount of information is greater than or equal to 0. Pay attention to a small probability event with a higher amount of information. There is no limit on the log base number selection. Most of the information theory uses 2, which is the binary number of digits required to transmit the information.
If we want to transmit the value of this random variable, the average information we transmit can be expressed as expected about the distribution of P (x:
This expression is called information entropy.

In machine learning, there are many natural logarithm forms.

If x = 0
So let's make p (x) ln (x) = 0

If this information is encoded and transmitted, we hope that a longer encoding will be used with a higher probability. If the probability is small, we adopt a longer encoding. The maximum entropy can reach the minimum length. For the relationship between entropy and the minimum length, refer to Shannon's noiseless coding theorem.

Entropy is used to describe the state of a specified random variable and the required average information. If we want to maximize entropy, we use the Laplace multiplier:

We can obtain the maximum value, where M is the number of X states.

If we have a joint distribution p (x, y), if X already knows, you can use-ln P (Y | X) to specify the amount of information required for the value of Y) so the average amount of information required can be expressed:

It is called conditional entropy. Using the multiplication rule, we can get the following:

Relative Entropy and mutual information: consider an unknown distribution p (x). Suppose we use an approximate distribution Q (x) to model it. If we use Q (x) to build an encoding mode to transmit the value of X. You need to specify more information:

This formula is called relative entropy or Kullback-Leibler Divergence relative entropy to describe the degree of difference between P (x) and Q (x) distributions. Note:

We consider the joint distribution p (x, y). If X and Y are independent of each other, P (x, y) = p (x) P (y) if they are not independent of each other, so we want to know their degree of correlation, we can use KL divergence to measure:

This expression is called the mutual information of the variable X and Y. From the attributes of KL divergence, we know that I (x, y)> = 0 is equal only when X and Y are independent of each other. We use addition and multiplication rules to obtain mutual information relative to conditional entropy:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Information Theory-Basic Knowledge

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support