Chapter 6: Measurement and function of information

Source: Internet
Author: User

1. Information Entropy

We use a vivid example to illustrate this concept: We will guess who will win the World Cup these days. Assume that there are 32 teams numbered from 1 to 32. Then I asked, "Is the champion in the middle of the year 1-16 ?", If not, it is in the range of 6 to 32, and so on. We can guess who is the champion (log32) five times at most ). But in fact, we may not have to guess it five times, because a strong team like Brazil, Germany, and Italy is more likely to win the championship than others. Then, when grouping, put a few strong teams in one group ., And so on, you do not need to calculate it five times. Shannon uses the concept of bit to measure the amount of information. He pointed out that the above accurate information is:

 

Among them, Pi is the probability of winning the championship for each team. H is called information entropy. When P is equal, H is the maximum value, which is 5. In other cases, P is less than 5.

 

Let's formally define the entropy. Given the random input vector x (for example, the probability of winning the championship), then:

 

2. Functions of information

 

We will not talk about the story. We know that there is randomness in a transaction (such as a local strategic decision), that is, uncertainty, which is assumed to be U, the only way to eliminate this uncertainty from the outside is to introduce information I. The size of I depends on the size of U, that is, I> U. When I <u, this information can eliminate some uncertainty, resulting in new uncertainty: u '= u-I (6.3)

 

In Web search, the essence is to find webpages related to user input from a large number of (Billions) webpages. There are several billion web pages with great uncertainty U. Our goal is to reduce U, that is, to eliminate uncertainty as much as possible. If there is not enough information and the amount of web pages is large, the correct method is to mine new hidden information, such as the quality information of the Web page. If not, ask the user. The incorrect approach is to play numbers and formulas games on keywords, or even introduce human assumptions, which is no different from Meng.

 

The more information you know, the less uncertainty the random event has. This information can be either direct or indirect. Due to the existence of these "related" information, we introduce the concept of conditional entropy.

 

Assume that X and Y are two random variables. We need to know about X. Suppose we know the distribution p (x) of X, then we know the entropy (6.2) of X, and we also know some conditions of Y, including the probability (joint probability) that X and Y appear together, and the probability distribution (conditional probability) of X when y gets different values ). Then the Conditional Entropy of X under y is:

We can prove that the information entropy of a single-dimensional model is the largest, and the three-dimensional model is the least, that is, the three-dimensional model is the best.

 

In summary, we are talking about the function of information to eliminate uncertainty. A large number of NLP problems are to find relevant information.

 

3. Extended reading: Application of Information Theory in Information Processing

 

3.1 mutual information

 

Shannon proposed the concept of "mutual information" in Information Theory to quantify the correlation between two random events ". Assuming two random times X and Y, their mutual information is defined as follows:

This formula is clear ~ In fact, it is the amount of information provided to eliminate the uncertainty of X on the premise of Y.

 

In NLP, we can easily calculate p (x, y), p (x), and p (y ). Then calculate the mutual information.

 

Here is an example of practical application. US President Bush can also be translated into bush. How can we translate Bush correctly? Some people say that when the President is added as the object, the subject is the personal name. However, in this case, there must be related rules for other words that do not mean anything, in general, there are countless rules to regulate our Machine Translation results. In addition, the President may not be a country. Many international organizations have one country as the rotating chairman. The best way is to use mutual information. As President Bush, he always appears with the president, Washington, the United States, and the White House. Similarly, when he is a bush, he usually appears with the soil, plants, and wild. With these two groups of words, we can look at the context to determine the meaning of Bush.

 

3.2 relative entropy

 

Relative Entropy, also known as cross entropy ". It is also used to measure correlation, but the difference is that it is used to measure the similarity of two functions with positive values. The definition is as follows:

We don't have to worry about the formula, the complexity, and computer computing. We just need to remember the following three conclusions:

1. For two completely identical functions, the relative entropy is 0.

2. The greater the relative entropy, the greater the function difference

3. For probability distribution or probability density functions, if the values are greater than 0, the relative entropy can measure the differences between two random distributions.

There are many applications of relative entropy. For example, they can be used to measure the probability distribution of two common words (in terms of syntax and semantics) in different texts to determine whether they are synonymous; or, based on the distribution of different words in the two articles, check whether their content is similar. Using relative entropy, we can obtain a most important concept in information retrieval: TF-IDF.

Information entropy is a measure of uncertainty. Therefore, it can directly measure the quality of the statistical language model. We should use conditional entropy because of context. If there is a deviation between the probability function obtained from the training corpus and the text of the actual application, the relative entropy will be introduced. Janic defined the concept of language model complexity based on Conditional Entropy and information entropy. It is used to measure the quality of the language model.

 

 

 

 

Chapter 6: Measurement and function of information

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.