Information Theory-Basic Knowledge

Source: Internet
Author: User

How much information does a discrete random variable X bring to us when we observe its value? This amount of information can be seen as the surprise we have observed from the value of X. We are told that a very unlikely event has occurred much more than a very likely event. Therefore, the amount of information depends on the probability distribution p (x), so we can use a function about p (x) to model the amount of information h (x ). so what function models are suitable for expression?
We observe two mutually independent events X and Y, And we observe the amount of information it gets, which is equal to the sum of the information they get separately. That is
H (x, y) = h (x) + H (y)
The probability relationship between two independent time X and Y:
P (x, y) = p (x) P (y)
Based on the above observation, the amount of information must be related to the log function of P (x.
So we get:

Adding a negative number ensures that the amount of information is greater than or equal to 0. Pay attention to a small probability event with a higher amount of information. There is no limit on the log base number selection. Most of the information theory uses 2, which is the binary number of digits required to transmit the information.
If we want to transmit the value of this random variable, the average information we transmit can be expressed as expected about the distribution of P (x:
This expression is called information entropy.

In machine learning, there are many natural logarithm forms.


If x = 0
So let's make p (x) ln (x) = 0

If this information is encoded and transmitted, we hope that a longer encoding will be used with a higher probability. If the probability is small, we adopt a longer encoding. The maximum entropy can reach the minimum length. For the relationship between entropy and the minimum length, refer to Shannon's noiseless coding theorem.

Entropy is used to describe the state of a specified random variable and the required average information. If we want to maximize entropy, we use the Laplace multiplier:

We can obtain the maximum value, where M is the number of X states.

If we have a joint distribution p (x, y), if X already knows, you can use-ln P (Y | X) to specify the amount of information required for the value of Y) so the average amount of information required can be expressed:

It is called conditional entropy. Using the multiplication rule, we can get the following:


Relative Entropy and mutual information: consider an unknown distribution p (x). Suppose we use an approximate distribution Q (x) to model it. If we use Q (x) to build an encoding mode to transmit the value of X. You need to specify more information:

This formula is called relative entropy or Kullback-Leibler Divergence relative entropy to describe the degree of difference between P (x) and Q (x) distributions. Note:


We consider the joint distribution p (x, y). If X and Y are independent of each other, P (x, y) = p (x) P (y) if they are not independent of each other, so we want to know their degree of correlation, we can use KL divergence to measure:

This expression is called the mutual information of the variable X and Y. From the attributes of KL divergence, we know that I (x, y)> = 0 is equal only when X and Y are independent of each other. We use addition and multiplication rules to obtain mutual information relative to conditional entropy:

Information Theory-Basic Knowledge

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.