Praml Study Notes-Information Theory

Last Update:2018-12-07 Source: Internet

Author: User

Tags natural logarithm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Praml Study Notes - Information Theory Overview

<Pattern Recognition and machine learning>This book takes notes on Introduction to information theory,

ForRandom Variable x, ItHow much information is carriedWhat about it?

When weWhen we observe a specific value of X, WeHow much information is obtainedWhat about it?

The amount of information can be expressed"Surprise"(Degree of surprise ).

If we observeUncommon things happen obviously with high surprise, GetInformation volumeBig, extreme. If we know that an event happens, we don't have any information.

We consider the h (x) associated with the probability distribution p (x) to indicate the amount of information when X is observed. When weWhen we think that X and Y are irrelevant, we can think that the amount of information our colleagues observe is the same as the amount of information they observe separately. Corresponding to p (x), P (y)

H (x, y) = h (x) + H (y)ÓP (x, y) = p (x) P (y)

This relationship implies that h (x) corresponds to p (x ).Log

The log base number is random. Here 2 is used, which means that the amount of information is used.BitTo measure.

ConsiderThe sender wants to transmit the random variable value to the receiver., ThenOn average, the amount of information transmittedThat is:

Assume that there are eight States for a random variable X, and each type may be H [x] = 3. If it is (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64), H [x] = 2 is calculated, and the overall information of the uneven distribution is small. 3 indicates that we use three bits for representation, if the probability of some States is small, we can allocate a longer encoding. If the probability is large, we can assign a shorter encoding to obtain the shortest average encoding. The Huffman encoding principle is based on the amount of information.

Another physical interpretation of the amount of information, the disorder degree, refers to the probability that N is allocated to a series of buckets with the same problems, if the log is obtained and then divided by N, when n tends to infinity, it is equivalent to the above formula.

Extended to Continuous Variables

The natural logarithm is used later. The concept is similar to that of discrete continuous. For the proof of integral, see P52. The following formula is obtained.

So for Gaussian distribution, we can use the above formula to get it.

ThereforeVarianceLargerThat is, the lower the height of the Gaussian distribution, the more even the distribution.More information.

A high amount of information corresponds to a gentle boringDistribution, low Information volume corresponds to multiple variations (peaks and valledys)Distribution

Conditional Entropy)

Considering the joint probability distribution, if weI already knowX, Then the otherOKYAmount of informationThen, on average, another(Additional)Information volumeThat is:

We can conclude that:

That is, the average information of X is observed, and the evaluation information of Y after X is known, that is, the average information of X and Y is observed at the same time.

Corresponding

In addition

ConsiderationsDiscreteStatus

A discrete example:

// Math, yes 1/4, history, no 1/4, Cs, yes 1/4, math, no 1/4

Relative information, mutual information (relative entropy, mutual information)

Return to the random variable transmission problem. Suppose we do not know the details during the transmission.Unknown), WeUse a known distribution,To simulate it.In this case, if we useCode as efficiently as possibleSo how much additional information does we need to describe X on average. This is calledRelativeEntropy, OrKL divergence.

It can be proved by using the inequality property of Convex Functions (also using discrete sum extension to continuous integral ).

Therefore, KL represents the relationship between the two distributions. A measure of dissimilariy of p and q indicates the degree to which the two distributions are different. (I think this formula also shows that the encoding length obtained by H [x] is the optimal. Other encoding lengths can only be longer .)

Mutal info:

If two random variables are irrelevant, we can use the values to indicate the degree of correlation between P (x) and P (Y). If they are 0, they are irrelevant, the greater the KL value, the higher the correlation degree.

Todo (Mutal info and correlation? Correlation = 0 only indicates Linear Non-correlation. Is that the difference ?)

That is, X and Y are irrelevant, that isObservedYXThe amount of information is equal to XInformation (YNo additional help),

In this case, H [x] = H [x | Y]

Application of mutual information Mutual Information 1. Feature Selection: (This part is from <Introduction to Information Retrieval> Chapter 4 text classification and Naive Bayes)

For text classification applications, if all words are considered as feature training, the data size may be too large, and non-representative words may be used as noise to affect the classification accuracy. How can we select a representative Feature Word? For example, for category C, K words are selected as features, such as category "China"-> China, Chinese, Beijing, Yuan, Shanghai, Hong, Kong, Xinhua .... The book introduces two solutions,Mutual InformationAndChi-square test (about chi-square test)Here we will only introduce the former.

The idea isMutual information is used to represent the amount of information contained in a word to characterize this category.The information in the training set isWhat are the words in each text and the class of each text?. C

IfA word is in the entire text setCollectionThe distribution in is the same as that in this category., Then we thinkMutual Information is0, OppositeMaximum mutual information,If a word appears in a text, and only if the text appears in this category.

For example, suppose there is4DOC files: C1, C2, B1, B2,Before2CClass, followed by 2Non-CClass

Suppose case1:

WordC ++And only in C1, B1,C ++And CThe mutual information of classes is 0.,

If case2:

C ++And only in C1, C2C ++And CClass reaches the maximum mutual information 1 (, The maximum value of the two States H [x] is 1).

If case3:

C ++ appears only in C1. The mutual information is a value between 0 and 1.

Mutual information is defined as follows:

Random Variable uValueWhen text contains t,If the text does not contain t.

Random Variable CValueWhen the text belongs to Category CVice versa.

FollowMaximum LikelihoodThe preceding formula can be used as follows. N indicates the number of all texts, indicating the number of texts that contain the word T and do not belong to Class C.

For the above example

Case1:

Case2:

I personally think that there is 0 in the above log, it can be processed smoothly or assume that the 0 value in the log is 0, because

If n-> 0 is used to obtain the limit, nlogn = 0 can also be calculated as follows

I [u] = 1 // both statuses are 0.5 possible

I [c | u] = 0.5 * H [c | u = ET = 1] + 0.5 * H [c | u = ET = 0] = 0 + 0 = 0 // each of them has only one fixed status.

So I (U; c) = I [u]-I [c | u] = 1

Case3:

Note that if similar C ++ is present and only appears in B1

That is, 0.5.And CClass correlation is also 0.5But this actually shows that C ++Does not belong to CCategory level!That is, if a docIf C ++It indicates that it is probably not cClass. (? QuestionIs there a problem in some particular scenarios? Side effects ?)

I think this method can be used to search and click logs on B2C and C2C websites, and use the product as Doc and search term as term, you can calculate the correlation between a term and each category similarly to locate the intent of a user's search category. In the next step, I will try a test and paste the results to see if the computation is more reliable than the simple cosine similarity calculation. (The side effects of Todo should not be affected ~)

In addition, it seems that TF and IDF are somewhat similar in this category. In the whole, they do not exclude many deprecated words and appear in the class as a whole. However, the above mutual information model does not consider the number of times words appear in the doc.

2. Application in new word discovery

Http://www.cnblogs.com/TtTiCk/archive/2008/06/25/1229480.html

Other materials

L chi-square test

Http://www.cnblogs.com/finallyliuyu/archive/2010/09/06/1819643.html

L mutual information and Correlation Coefficient

Http://controls.engin.umich.edu/wiki/index.php/Correlation_and_Mutual_Information#Sample_Correlation_Coefficient

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More