Beautiful mathematical series 4-how to measure information?

Source: Internet
Author: User
Beautiful mathematical series 4-how to measure information?

Poster: Wu Jun, Google researcher

Google has always been committed to integrating global information so that everyone can access and benefit everyone. So how should each piece of information be measured?

Information is a very abstract concept. We often say that there is a lot of information, or there is little information, but it is hard to tell how much information there is. For example, the amount of information contained in a 0.5 million-Word document. It was not until 1948 that Shannon proposed the concept of "information entropy" (sh ā ng) that it solved the problem of quantitative measurement of information.

The size of a piece of information is directly related to its uncertainty. For example, we need to know a lot about a very uncertain thing or something we know nothing about. On the contrary, if we already have a better understanding of something, we can clarify it without too much information. Therefore, from this perspective, we can consider that the measurement of the amount of information is equal to the amount of uncertainty.

How can we quantify the amount of measurement information? Let's look at an example. The World Cup is coming soon. Everyone cares about who will be the champion. If I missed the World Cup, I asked an audience who knew the results of the competition, "which team is the champion "? He doesn't want to tell me directly, but he wants to let me guess, and every time I guess, he wants to charge a dollar to tell me if he guessed it right, so how much do I need to pay him to know who is the champion? I can compile the team numbers from 1 to 32 and then ask: "is the champion team in the range of 1 to 16 ?" If he tells me I guess it is correct, I will ask: "is the champion in the No. 1-8 ?" If he tells me that I guess it is wrong, I naturally know that the champion army is in the middle of 9-16. In this way, I only need five times to know which team is the champion. Therefore, the message about who is the World Cup champion is only worth five yuan.

Of course, Shannon does not use money, but uses the "bit" concept to measure the amount of information. One bit is a binary number, and one byte in the computer is eight bits. In the preceding example, the message information is five bits. (If one day there are sixty-fourteen teams entering the finals, the amount of information about "who wins the World Cup" will be six bits, because we have to guess one more time .) Readers may have discovered that the number of bits in the amount of information is related to the log of all possible logarithm functions. (Log32 = 5, log64 = 6 .)

Some readers may find that we do not need to guess who is the champion five times, this is because teams like Brazil, Germany, and Italy are more likely to win the championship than teams like Japan, the United States, and South Korea. Therefore, we do not need to divide 32 teams into two groups when making our first guess. Instead, we can divide a few of the most likely teams into one group and divide the other into another group. Then we guess whether the champion team is among those popular teams. We repeat this process to group the remaining candidate teams based on the probability of winning the championship until the champion is found. In this way, we may guess the result three or four times. Therefore, when each team is likely to win the championship (probability), the "who wins the World Cup" information is less than five bits. Shannon pointed out that its precise information should be

=-(P1 * log p1 + p2 * log p2 +... + p32 * log p32 ),

Among them, p1, p2,..., p32 are the probability that the 32 teams will win the championship. Shannon calls it "Entropy", which is generally represented by the symbol H, in bits. Interested readers can estimate that when 32 teams win the championship, the corresponding information entropy is equal to five bits. Readers with a mathematical foundation can prove that the value of the above formula cannot be greater than five. For any random variable X (such as the champion team), its entropy is defined as follows:

The greater the uncertainty of a variable, the greater the entropy, and the greater the information required for figuring out the variable.

With the concept of entropy, we can answer the question raised at the beginning of this article, that is, the average amount of information in a 0.5 million-Word document. We know that the commonly used Chinese characters (level 1 and level 2 National Standards) are about 7000 characters. Assume that each word is equal to the probability, then we need 13 BITs (that is, 13 BITs) to represent a Chinese character. However, the usage of Chinese characters is unbalanced. In fact, the first 10% of Chinese characters account for more than 95% of the text. Therefore, even if context relevance is not taken into account, but the independent probability of each Chinese character is considered, the information entropy of each Chinese character is about 8-9 bits. If context relevance is further considered, the information entropy of each Chinese character is only about 5 bits. Therefore, the amount of information in a 0.5 million-word Chinese document is about 2.5 million bits. If you compress it with a good algorithm, the entire book can be saved into a kb file. If we store this book directly using two-byte national standard encoding, it will take about 1 MB, three times the size of the compressed file. The gap between the two numbers is called "redundancy" in information theory ). It should be pointed out that the 2.5 million bits we mentioned here are an average. Books of the same length can contain much less information. If a book has a lot of repeated content, it will have a small amount of information and a large amount of redundancy.

The redundancy of different languages varies greatly, while the redundancy of Chinese in all languages is relatively small. This is consistent with the general understanding that "Chinese is the most concise language.

In the next set, we will introduce the application of information entropy in information processing and the mutual information and relative entropy of two related concepts.

Readers interested in Chinese information entropy can read an article written by Professor Wang zuoying and I in the electronics newspaper.
Complexity of language information entropy and Language Model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.