http://blog.csdn.net/pipisorry/article/details/51695283
This article is mainly about: entropy, joint entropy (joint entropy), conditional entropy (conditional entropy), relative entropy (relative entropy,kl distance), cross-entropy (crosses entropy), perplexity (perplexity ), Mutual information (mutual information).
entropy (information theory)/entropy
In information theory, entropy is the average amount of information contained in each message received, also known as entropy, source entropy, and average self-quantity. Here, the message represents an event, sample, or feature from a distribution or data stream. (entropy is best understood as a measure of uncertainty rather than a measure of certainty, because the more random the entropy of the source is the greater.) )
Another characteristic from the source is the probability distribution of the sample. The idea here is to compare things that are unlikely to happen when it happens and will provide more information. in the information world, the higher the entropy, the more information can be transmitted, and the lower the entropy, the less information is transmitted. for some other reason (explained below), it makes sense to define information (entropy) as the inverse of the logarithm of the probability distribution.
The probability distribution of events and the amount of information for each event constitute a random variable, the mean (or expectation ) of the random variable, which is the average (i.e. entropy) of the amount of information generated by the distribution. The logarithm of probability distribution as a measure of information is the reason for its availability. For example, throwing a coin provides 1 sh of information, while throwing m times is the M position. More generally, you need to use the LOG2 (n) bit to represent a variable that can fetch n values.
In the 1948, Claude Aire Shannon introduced the entropy of thermodynamics to information theory, so it was called Shannon Entropy.
The source of information entropy formula
Suppose that the title of an article is called "What the Black Hole eats", including the words {black holes, in the end, what to eat}, we now have to speculate on the category of this article according to a word. Which word gives us the most information? It is easy to know that it is a "black hole", because the word "black hole" appears in all documents is too low, once it appears, it is likely that this article is about science knowledge. The probability that the other two words "in the end" and "What to eat" appear is high, but the less information is given to us.
How to use a function h (x) to denote the amount of information given by a word? first, it must be related to P (x) and is negatively correlated. Second, assuming that X and Y are independent (the black hole and the universe are not independent of each other, when the black hole is bound to say the universe), that is, p (x, y) = P (×) p (y), then the obtained information is superimposed, i.e. h (x, y) = h (x) + H (Y). The function that satisfies these two conditions must be negative logarithm form:
For the assumption that a sender wants to send a long sequence of random values generated by the random variable X to the receiver, the average amount of information the recipient obtains is the mathematical expectation:
This is the concept of entropy. Another important feature is that the entropy size is the same as the average minimum encoding length of the character (Shannon). With an unknown distribution P (x), and Q (x) is an approximation of P (x) we obtain, according to Q (x) to encode the values of the random variable, the average length is more than the real distribution of P (x) to encode an extra longer, more than the length of this is the KL divergence (the reason is not to say the distance, because Not satisfying the symmetry and triangle laws), namely:
Calculation of entropy
If there is an ideal coin with equal chance of being on the front and opposite sides, the entropy of the coin toss event is equal to the maximum value it can reach. We have no idea what the result of the next coin toss is, so every coin toss is unpredictable.
Therefore, using a normal coin for several throws, the entropy of this event is one bit, because the result is only two-positive or negative, can be expressed as 0, 1
coding, and two results are independent of each other. For a n
second independent experiment, the entropy is n
because it can be expressed as n
a bit stream of length.
But if the two sides of a coin are exactly the same, the entropy of the coin toss event is equal to zero, because the result can be predicted accurately. In the real world, the entropy of the data we collect is between the two cases above.
Another slightly more complicated example is the assumption of a random variable X
, taking three possible values, the probability is, then the average bit length encoded is:. Its entropy is 3/2.
Therefore, entropy is actually the mathematical expectation of multiplying the number of bits of random variables and the probability of sequential occurrence.
Phi Blog
Entropy
The definition of entropy
Entropy is also known as self-information (self-information), which indicates the average information provided by the source X for each symbol (no matter what symbol is sent). Entropy can also be seen as the number of uncertainties that describe a random variable. The greater the entropy of a random variable, the greater its uncertainty. The less likely it is to correctly estimate its value. The more uncertain a random variable, the greater the amount of information needed to determine its value.
According to Boltzmann's H-theorem, Shannon defines the entropy value Η (Greek letter Eta) of the random variable x as follows, with a value of {X1, ...,xn}:
Where P is the probability mass function of x (probability mass function), E is the desired , and I (x) is the amount of x (also known as self-information ). I (X) itself is a random variable.
When taken from a finite sample, the entropy formula can be expressed as:
Note: the units of entropy are usually bits, but also with SH, NAT, Hart metering, depending on the definition used to the base of the logarithm. Here b is the base used for logarithms , usually 2, natural constant e, or 10. When B = 2, the unit of entropy is bit; when b = e, the unit of entropy is Nat, and when b = 10, the unit of entropy is Hart.
P I = 0 o'clock, for some I values, the value of the corresponding Summand 0 logb 0 will be 0, which is consistent with the limit.
-
Joint entropy (joint entropy)
Union entropy is actually the amount of information required to describe the average of a pair of random variables.
Conditional entropy (conditional entropy)
Defines the conditional entropy of the event X and Y when taking XI and YJ respectively
where P(XI, YJ) is the probability of X = xi and Y = YJ . This amount should be understood as the amount of randomness of the random variable X If you know the value of Y .
Relative entropy (relative entropy, or kullback-leiblerdivergence, KL distance)
Relative entropy is often used to measure the gap of two random distributions. When two random distributions are the same, their relative entropy is 0. When the difference of two random distributions increases, the relative entropy increases.
Crossover entropy (cross entropy)
If a random variable x ~ p (x), q (x) is used to approximate the probability distribution of P (x), then the cross-entropy between the random variable x and the model Q is defined as:
The concept of cross-entropy is used to measure the difference between an estimated model and a true probability distribution.
Degree of perplexity (perplexity)
In designing language models, we often use confusion to measure language models instead of cross-entropy. Sample of a given language L
The task of language model design is to find the least confused model, which is closest to the real language.
Mutual information (Mutual information)
if (x, Y) ~ p (x, y), X, y the mutual information between I (x; Y) is defined as:
I (X; Y) = H (x) –h (x | Y) (11)
According to H (X) and H (x| Y) is defined as:
Mutual information I (X; Y) is the amount of uncertainty about X after knowing the value of Y, that is, the value of y reveals how much information about X.
Note: Mutual information I (X; Y) can be positive, negative, or 0.
The difference and relation between mutual information, conditional entropy and joint entropy
Since H (x| x) = 0, so h (x) = h (x) –h (x| x) = I (x; X
This explains why entropy is also called self-information, and on the other hand shows that the mutual information between two fully interdependent variables is not a constant, but rather depends on their entropy.
and the relation of thermodynamics entropy
Physicists and chemists are more interested in the change in entropy that occurs when a system spontaneously evolves from its initial state, following the second law of thermodynamics. In the traditional thermodynamics, the entropy is defined as the macroscopic measurement of the system, and the probability distribution is the core definition of the information entropy.
Phi Blog
Entropy Solution ExampleEntropy Calculation Example 1
Entropy Calculation Example 2
Note that the edge probability here is based on each syllable, whose value is twice times the probability of each character, so the probability value for each character should be 1/2 of the corresponding edge probability, namely:
P:1/16 t:3/8 k:1/16 A:1/4 I:1/8 U:1/8
There are several ways to find the joint entropy, the following we can get by using the chain rule method:
from:http://blog.csdn.net/pipisorry/article/details/51695283
Ref: [http://zh.wikipedia.org]
Entropy and mutual information