Beauty of mathematics Series 7-Application of Information Theory in Information Processing

Last Update:2018-12-05 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Beauty of mathematics Series 7: Application of Information Theory in Information Processing

Poster: Wu Jun, Google researcher

We have already introduced information entropy, which is the basis of information theory. This time we will talk about the Application of Information Theory in natural language processing.

First, let's look at the relationship between information entropy and the language model. When talking about the language model in Series I, we didn't talk about how to quantitatively measure the quality of a language model. Of course, readers naturally think that, since the language model can reduce the number of errors in speech recognition and machine translation, you can try it with a speech recognition system or machine translation software. A good language model will inevitably lead to a low error rate. This idea is correct, and so is today's speech recognition and machine translation. However, this test method is neither direct nor inconvenient for developers who develop language models, and it is difficult to reverse the quantitative measurement of language models from the error rate. In fact, when Fred Jelinek was studying language models, there was no decent speech recognition system in the world, not machine translation. We know that the language model uses context to predict the current text. The better the model is, the more accurate the prediction is, the less uncertainty the current text has.

Information entropy is a measure of uncertainty. Therefore, information entropy can be directly used to measure the quality of statistical language models. Janic defined a concept called Perplexity from the Information Entropy to directly measure the language model. The smaller the complexity of a model, the better the model. Dr. Lee introduced his invention to the Sphinx speech recognition system that the complexity would be 997 if no language model (that is, the zero-element language model) was used, that is to say, there are 997 possible words in each position in a sentence. If the (Binary) language model only considers the combination of prefix and deprefix words and does not consider the probability of collocation, the complexity is 60. Although it is much better than the language model, it is much worse than the binary language model considering the probability of collocation, because the latter is only 20 complex.

The other two important concepts behind entropy in Information theory are Mutual Information and Kullback-Leibler Divergence ).

Mutual Information is an extended concept of information entropy, which measures the correlation between two random events. For example, today's random event, rain in Beijing, and random variable air humidity are highly correlated, but it has nothing to do with whether Yao Ming's Houston Rockets can win the bulls. Mutual information is used to quantify the correlation of measurements. In natural language processing, we often need to measure the relevance of some language phenomena. For example, in machine translation, the most difficult problem is the ambiguity of the meaning. For example, the term "Bush" can be the name of the US President or the Bush. (There is a joke that the name of Kyrie Kerry, the last U.S. presidential candidate, was translated by some machine translation systems into "White Cows in Ireland". Kerry has another meaning in English .) So how can we translate this word correctly? It is easy to think of using syntax, analysis statements, and so on. So far, no syntax can solve this problem well. The practical method is to use mutual information. The specific solution is roughly as follows: first, find out the words with the largest mutual information that appears with President Bush from a large number of texts, such as the president, the United States, Congress, and Washington. Of course, use the same method to find out the words with the largest mutual information, such as soil, plants, and wildlife. With these two groups of words, when translating Bush, you can look at the types of Related Words in the context. This method was initially proposed by Gale, Church, And Yarowsky.

At that time, yarenski was a Ph.D. student at the University of binsefonia, a natural language processing master, professor Marcus (Mitch Marcus). He spent a lot of time in the research rooms of Bell Labs such as Qiu Qi. Maybe he is eager to graduate. With the help of gill and others, he has come up with the fastest and best solution to the ambiguity in translation. This is the above method, this seemingly simple method works well to surprise colleagues. It took him three years to get his doctor from Marcus, and it took him six years on average.

Another important concept in information theory is relative entropy, which is called cross entropy in some documents ". In English, Kullback-Leibler Divergence is named by the names of its two initiators kurbeck and lyboro. Relative Entropy is used to determine whether two positive functions are similar. For two completely identical functions, their relative entropy is equal to zero. In natural language processing, relative entropy can be used to determine whether two common terms (in terms of syntax and semantics) are synonymous, or whether the content of the two articles is similar. By using relative entropy, we can search information everywhere. The most important concept is Word Frequency-reverse Document Frequency (TF/IDF ). Next we will introduce how to sort the searched webpages Based on relevance, and the concept of meal TF/IDF will be used. In addition, relative entropy and TF/IDF are also used in news classification.

If you are interested in Information Theory and have a certain mathematical foundation, you can read the monograph "Elements of Information Theory" by Professor Thomas Cover at Stanford University ):
Http://www.amazon.com/gp/product/0471062596/ref=nosim/103-7880775-7782209? N = 283155
Http://www.cnforyou.com/query/bookdetail1.asp? ViBookCode = 17909
Professor corver is the most authoritative information theory expert today.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More