Given a discrete variable, we observe the size of the amount of information contained in each of its values, so we use it to represent the size of the information, the probability distribution is. When P (x) =1, this event is bound to occur, so it gives me a message of 0. (because it will happen, there is no suspense)
If X and Y are independent, then:
The relationship between them is:
(P (x) =1, h (x) = 0, minus sign to ensure that H (x) is positive, where 2 is the bottom is random, can take other positive numbers (except 1))
Therefore, for all x values, its entropy is:
Note: When encountered,
Here is an explanation of the entropy of information:
——————————————————————————————— Information Entropy —————————————————————————————————————————————————
Information entropy is the measure of information, what is information? The intuitive understanding is that knowing what we do not know is the process of acquiring information, so for individuals, the greater the unknown, the greater the uncertainty, the greater the corresponding information should be, transmission or storage should pay more costs. For example, we say that the sun rises in the east, it is a necessity, heard this sentence did not get any information, so its entropy is 0, information theory and uncertainty equivalence.
With this intuitive understanding, it can be mathematically, in mathematics to express uncertainty is the probability. In the above example, when we discuss the information, we are essentially talking about the probability of the occurrence of an event, which has a certain probability of happening. We say that information entropy is big or small, and it is not confusing to be clear about which random variable is the information entropy. The probability of a thing happening is big, then its entropy is small, one thing happens the probability is small, then its entropy is big.
Ask 32 teams which won the championship, we can set a random variable x for the championship team, assuming that each team won the probability of equal, then a team of entropy X1 expressed as H (x1) = LOGP (x1), all the team sum to get x entropy:
In general, the logarithm takes 2 as the base, indicating the number of bits corresponding to x.
In short, information entropy allows us to quantify how much information we have, so that an abstract concept can be quantitatively described. In addition, when we talk about information entropy, we first clarify which random variable is the information entropy, what is the sample space of the random variable, and then use the knowledge of probability theory to find out. It is important to note that the amount of information is not necessarily related to the importance of it, and that information entropy only gives a numerical value in quantity, not the importance of the information.
———————————————————————————————————————————————————————————————————————————————————
As an example:
Suppose you want to send a discrete variable to someone else, the discrete variable has 8 values, and the 8 values take the same probability, then the entropy of this variable is:
Another example:
A variable contains 8 state values, the corresponding probability is:, then, its entropy is:
From the above example, it can be seen that the information entropy of nonuniform distribution is less than the entropy of uniform distribution . What is this for?
Because, if we want this variable to be sent to others, one way is to use 3 bits to represent each number, then its average decoding length is 3.
There is another way to do this:
with to express.
The average decoding length is 2:
There is nothing shorter than this to eliminate the two sexes.
(Feeling the beauty of mathematics again)
We can understand information entropy from another angle:
Suppose there are n identical items, and I have an item in the box. For the box I, first it has N choice to take the first item, N-1 choose to take the second item, therefore, for a box, it has n! choice. However, for the items in the same box, we do not want to differentiate, so for the first box, it has a sort of way, therefore, the total number of combined methods multiplicity :
Where the entropy takes the logarithm:
Was
So.
For a particular state, an instance, called a "microscopic state";
For all States, it is called a "macroscopic state", and W is a macroscopic state.
For a particular state, Xi, the probability is:
When the probability concentration of each microscopic state occurs near a few values, it tends to be less entropy.
:
Assuming there is a M state, we try to maximize the entropy:
We find that when equal, the corresponding entropy is:
, the entropy reaches the maximum value in this state.
now, let's assume that x is a continuous variable , and we ask for a bias:
Where Iij is the unit matrix.
According to the mean value theorem, we can conclude that:
Suppose the probability of a variable x falling in the box I is:
which
When approaching 0 o'clock, the rightmost second of the upper is approaching 0, and the expression of the first term approaching is called the differential entropy (differential entropy):
For continuous variables, how does entropy take maximum?
First, it satisfies the following constraints:
The Lagrange multiplier method can be used to obtain:
Simplification gets:
It can be found that the maximum probability distribution of differential entropy is Gaussian distribution.
When we calculate, we do not assume that the differential entropy must be non-negative, therefore, it is not a necessary condition.
Differential entropy expression for normal distribution:
It can be found that the entropy becomes larger with the increase of variance .
Relative entropy (Relative entropy) and mutual information (mutual information)