Mathematical beauty Series 1-statistical language model

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: Wu Jun

Http://www.google.com.hk/ggblog/googlechinablog/2006/04/blog-post_7327.html

After reading the first article, I decided to buy a book.

Preface

You may not believe that mathematics is the best tool for information retrieval and natural language processing. It can clearly describe the actual problems in these fields and provide beautiful solutions. When people use mathematical tools to solve a language problem, they always lament the beauty of mathematics. We hope to introduce some mathematical tools and how we use these tools to develop Google products.

Series 1: Statistical language model (statistical language models)

Google's mission is to integrate global information, so we have been committed to studying how to make machines better understand and process information and languages. For a long time, humans have dreamed that machines can replace humans to translate languages, recognize speech, recognize texts (whether printed or handwritten), and perform automatic searches for massive volumes of literature, this requires the machine to understand the language. However, human language can be said to be the most complex and dynamic part of information. In order to solve this problem, it is easy to think of a method that allows machines to simulate human learning-learning human syntax, analyzing statements, and so on. Especially after the rise of "formal language" by Noam Chomsky, the greatest linguistic language ever, people have strengthened their belief in using grammar rules for text processing. Unfortunately, over the past few decades, there has been almost no breakthrough in the field of computer processing languages based on this syntax rule.

As a matter of fact, as early as a few decades ago, the mathematician and information theory's ancestor, Shannon, put forward the idea of using mathematics to deal with natural language. Unfortunately, the computer conditions at that time could not meet the needs of a large amount of information processing, so his idea was not taken seriously at that time. At the beginning of 1970s, Shannon's dream was realized with a large-scale integrated circuit-based fast computer.

First, Fred Jelinek, a speech and language processing master, successfully solved the natural language processing problem using mathematical methods ). At that time, Janik took an academic vacation (sabbatical leave) at IBM and led a group of outstanding scientists to use computers to handle human language problems. The statistical language model was proposed at that time.

For example, in many fields that involve natural language processing, such as machine translation, speech recognition, printed or handwritten recognition, spelling correction, Chinese character input, and document query, we all need to know whether a text sequence can constitute a sentence that everyone can understand and display to users. We can use a simple statistical model to solve this problem.

If s represents a series of sorted words W1, W2 ,..., Wn, in other words, s can represent a meaningful sentence composed of a series of words rehearsed in a specific sequence. Now, from a certain perspective, Machine recognition is to understand the possibility of S appearing in the text, that is, the probability of S in mathematics is represented by P (S. Using the formula of conditional probability, the probability of occurrence of the S sequence is equal to the probability of occurrence of each word, so P (s) can be expanded:

P (S) = P (W1) P (W2 | W1) P (W3 | W1 W2 )... P (WN | W1 W2... Wn-1)

P (W1) indicates the probability that the first word W1 appears. P (W2 | W1) indicates the probability that the second word appears when the first word is known. It is not hard to see that the probability of occurrence of the word wn depends on all the words above it. From the computing point of view, there are too many possibilities to achieve. Therefore, we assume that the probability of occurrence of any word WI is only related to the word wi-1 before it (I .e. Markov hypothesis), so the problem becomes very simple. Now, the probability of S appears as follows:

P (S) = P (W1) P (W2 | W1) P (W3 | W2 )... P (WI | wi-1 )...
(Of course, we can also assume that a word is determined by the previous N-1, and the model is a little more complex .)

The next question is how to estimate p (WI | wi-1 ). Now with a lot of machine reads, this problem becomes very simple, as long as the number of times the word (wi-1, WI) appears in the statistical text, and the wi-1 itself in the same text before and after the adjacent appear how many times, and then use two a few division can be P (WI | wi-1) = P (wi-1, WI) /P (wi-1 ).

Many may not believe that such a simple mathematical model can solve complicated speech recognition, machine translation, and other problems. In fact, it is not just common people. Even many linguistics have questioned the effectiveness of this method. However, it turns out that statistical language models are more effective than any known solutions using certain rules. For example, in Google's automatic translation of Chinese and English, the most important thing is the statistical language model. Last year, the US Institute of Standards (NIST) evaluated all machine translation systems. Google's systems are not only the best in the world, but also much higher than all rules-based systems.

Now, the reader may already be able to feel the beauty of mathematics. It makes some complicated problems so simple. Of course, there are still many details that need to be addressed to implement a good statistical language model. Jalenick and his colleagues contributed by proposing a statistical language model and effectively solving all the details. More than a decade later, Kai-fu Lee used the statistical language model to simplify the problem of 997-word speech recognition into a 20-word recognition problem, realizing the first ever recognition of non-specific continuous speech with a large vocabulary.

I am a scientific researcher and I often wonder at the magic of mathematical language in solving practical problems at work. I also hope to explain this magic to you. Of course, in the final analysis, no matter what kind of scientific methods, no matter how wonderful the solution is to serve people. I hope that Google will try a little harder, so that users will be able to search more.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mathematical beauty Series 1-statistical language model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mathematical beauty Series 1-statistical language model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support