"PRML Reading notes-chapter1-introduction" 1.6 Information theory

Source: Internet
Author: User

Entropy

Given a discrete variable, we observe the size of the amount of information contained in each of its values, so we use it to represent the size of the information, the probability distribution is. When P (x) =1, this event is bound to occur, so it gives me a message of 0. (because it will happen, there is no suspense)

If X and Y are independent, then:

The relationship between them is:

(P (x) =1, h (x) = 0, minus sign to ensure that H (x) is positive, where 2 is the bottom is random, can take other positive numbers (except 1))

Therefore, for all x values, its entropy is:

Note: When encountered,

Here is an explanation of the entropy of information:

——————————————————————————————— Information Entropy —————————————————————————————————————————————————

Information entropy is the measure of information, what is information? The intuitive understanding is that knowing what we do not know is the process of acquiring information, so for individuals, the greater the unknown, the greater the uncertainty, the greater the corresponding information should be, transmission or storage should pay more costs. For example, we say that the sun rises in the east, it is a necessity, heard this sentence did not get any information, so its entropy is 0, information theory and uncertainty equivalence.

With this intuitive understanding, it can be mathematically, in mathematics to express uncertainty is the probability. In the above example, when we discuss the information, we are essentially talking about the probability of the occurrence of an event, which has a certain probability of happening. We say that information entropy is big or small, and it is not confusing to be clear about which random variable is the information entropy. The probability of a thing happening is big, then its entropy is small, one thing happens the probability is small, then its entropy is big.

Ask 32 teams which won the championship, we can set a random variable x for the championship team, assuming that each team won the probability of equal, then a team of entropy X1 expressed as H (x1) = LOGP (x1), all the team sum to get x entropy:

In general, the logarithm takes 2 as the base, indicating the number of bits corresponding to x.

In short, information entropy allows us to quantify how much information we have, so that an abstract concept can be quantitatively described. In addition, when we talk about information entropy, we first clarify which random variable is the information entropy, what is the sample space of the random variable, and then use the knowledge of probability theory to find out. It is important to note that the amount of information is not necessarily related to the importance of it, and that information entropy only gives a numerical value in quantity, not the importance of the information.

———————————————————————————————————————————————————————————————————————————————————

As an example:

Suppose you want to send a discrete variable to someone else, the discrete variable has 8 values, and the 8 values take the same probability, then the entropy of this variable is:

Another example:

A variable contains 8 state values, the corresponding probability is:, then, its entropy is:

From the above example, it can be seen that the information entropy of nonuniform distribution is less than the entropy of uniform distribution . What is this for?

Because, if we want this variable to be sent to others, one way is to use 3 bits to represent each number, then its average decoding length is 3.

There is another way to do this:

with to express.

The average decoding length is 2:

There is nothing shorter than this to eliminate the two sexes.

(Feeling the beauty of mathematics again)

We can understand information entropy from another angle:

Suppose there are n identical items, and I have an item in the box. For the box I, first it has N choice to take the first item, N-1 choose to take the second item, therefore, for a box, it has n! choice. However, for the items in the same box, we do not want to differentiate, so for the first box, it has a sort of way, therefore, the total number of combined methods multiplicity :

Where the entropy takes the logarithm:

Was

So.

For a particular state, an instance, called a "microscopic state";

For all States, it is called a "macroscopic state", and W is a macroscopic state.

For a particular state, Xi, the probability is:

When the probability concentration of each microscopic state occurs near a few values, it tends to be less entropy.

Assuming there is a M state, we try to maximize the entropy:

We find that when equal, the corresponding entropy is:

, the entropy reaches the maximum value in this state.

now, let's assume that x is a continuous variable , and we ask for a bias:

Where Iij is the unit matrix.

According to the mean value theorem, we can conclude that:

Suppose the probability of a variable x falling in the box I is:

which

When approaching 0 o'clock, the rightmost second of the upper is approaching 0, and the expression of the first term approaching is called the differential entropy (differential entropy):

For continuous variables, how does entropy take maximum?

First, it satisfies the following constraints:

The Lagrange multiplier method can be used to obtain:

Simplification gets:

It can be found that the maximum probability distribution of differential entropy is Gaussian distribution.

When we calculate, we do not assume that the differential entropy must be non-negative, therefore, it is not a necessary condition.

Differential entropy expression for normal distribution:

It can be found that the entropy becomes larger with the increase of variance .

Relative entropy (Relative entropy) and mutual information (mutual information)

relative entropy (relative entropy), also known as KL divergence (Kullback–leibler divergence, simplified kld), Information divergence (information divergence), information gain (information gain).

This is not symmetrical, that is, only when P (x) =q (x) is established

—————————————————————————————————————————————————————————————————————————

—————————————————————————————————————————————————————————————————————————

Convex function:

Properties:

Among them, and

Piano Inequality (Jensen ' s inequality)

The piano inequality (Jensen ' s inequality) is named after the Danish mathematician John Qin (Johan Jensen). It gives the relation between the convex function value of integral and the integral value of convex function. There is a corollary to the assumption that the secant of any two points on a convex function must be above the function image of the two points, namely:

version of probability theory in terms of probability theory, is aProbability measure . function swap for real valueRandom variables (in pure mathematics, there is no difference between the two). In the space, any function relative to the probability measure the integral is theExpectations . The inequality says that if is either a convex function, the .

E represents expectations. The Jensen inequalities of the continuous type variables are:

The Jensen inequalities are applied to the relative entropy, and are:

Relative entropy and likelihood function

Assuming that the unknown is actually distributed, we want to use a parametric model that combines n observations to determine an optimal simulation of the real distribution. A natural approach is to use the KL distance as the error function to determine the optimal parameter value by minimizing the KL distance and the standard.

The above error function is derivative with respect to the parameter, it is known that the minimum KL distance is equivalent to the maximization likelihood function .

Mutual information (mutual information)

Mutual information describes two variables that contain information about each other. defined as the KL distance between two distributions and

According to the nonnegative nature of relative entropy, mutual information is non-negative, and when only two variables are independent, mutual information is zero.

The mutual information can be seen as the degree to which the uncertainty of another variable is reduced when one variable is known.

"PRML Reading notes-chapter1-introduction" 1.6 Information theory

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.