N-gram Language Model

Source: Internet
Author: User

First, what is N-gram?

Wikipedia on the definition of N-gram:
N-gram is a statistical language model used to predict the nth item based on the previous (n-1) item. At the application level, the item can be a phoneme (speech recognition application), a character (input method application), a word (word breaker), or a base pair (genetic information). In general, N-gram models can be generated from large-scale text or audio corpora.
Accustomed to, 1-gram called Unigram,2-gram called Bigram,3-gram is trigram. There are Four-gram, Five-gram, etc., but the application of greater than n>5 is rare.

Second, the theoretical basis of n-gram

The idea of N-gram language model can be traced back to the research work of Shannon, a master of information theory, and he raises a question: Given a string of letters, such as "for Ex", what is the next most probable letter? From the training corpus data, we can get the n probability distribution by the method of maximum likelihood estimation: the probability of A is 0.4, the probability of B is 0.0001, the probability of C is ..., of course, don't forget the constraint: the sum of all n probability distributions is 1.
Derivation of probability formula of N-gram model. Based on conditional probability and multiplication formula:

Get

For an application, suppose T is a word sequence a1,a2,a3,... An, then P (T) =p (a1a2a3 ... AN) =p (A1) P (a2| A1) P (a3| A1A2) ... P (an| A1A2 ... AN-1)
If this is calculated directly, it is very difficult to introduce Markov hypothesis, namely: the occurrence probability of an item, only with its first m items, when m=0, is Unigram,m=1, is the Bigram model.
Therefore, P (t) can be obtained, for example, when using the Bigram model, p (t) =p (A1) p (a2| A1) P (a3| A2) ... P (an| AN-1)
and P (an| AN-1) The conditional probability can be obtained by maximum likelihood estimation, equal to Count (An-1,an)/count (An-1).

Iii. What is the length of n-gram data?

In fact, it is not rigorous to say what n-gram looks like. It is just a language model, as long as the information needed to store it, as to what format is based on the application to decide. For example, the famous Google Books Ngram Viewer, its n-gram data format is this:

circumvallate   1978   335    91circumvallate   1979   261    91

Represents a 1-gram piece of data, the first line means that the word "circumvallate" appeared 335 times in 1978, and there are 91 books. These metadata, in addition to the frequency of 335 times is necessary, other metadata (for example, and speech, etc.) can be based on the application needs. Here is a 5-gram data fragment:

isas  1991  1   1   1

Of course, it can also be other forms, for example, HANLP's N-gram model is bigram:

—@北冰洋   2—@卢森堡   1—@周日    1—@因特网   1—@地 1—@地域    1—@塔斯社   9—@尚义    12—@巴 1—@巴勒斯坦  1—@拉法耶特  3—@拍卖    1—@昆明    1

Each row represents the frequency at which two adjacent words appear together (relative to the underlying corpus).

Iv. N-gram What is the use of 4.1 cultural studies

The N-gram model looks dull and cold, but in fact, the Google Books Ngram project has spawned a new discipline (culturomics) to study human behavior and cultural trends through digitized texts. You can see the detailed description of the above. The book "Visualization of the Future" is also described in detail.

And Ted's video, "What_we_learned_from_5_million_books," is wonderful.

4.2 Word segmentation Algorithm 4.3 speech recognition 4.4 Input Method

What everyone is using every day, see: Enter "Tashiyanjiushengwude", the possible output is:

它实验救生无得他实验就生物的他是研究圣物的他是研究生物的

Which is the input of the most want to express the meaning of the technology behind this will be used to N-gram language model. Item is the possible word for each phonetic alphabet. Do you remember smart ABC? It is said to be the originator of the use of N-gram.

But Sogou input method came from behind, it uses more advanced cloud computing technology (N-GRAM model data volume is quite big, will say later)

4.5 Machine Translation Five, more understanding of N-gram

Do probability statistics know, the larger the size of corpus, make n-gram to statistical language model is more useful, for example, Google Books Ngram project, alone on the Chinese n-gram, from 1551 to 2009, the overall size is as follows:

....1999    1046431040  8988394 9256 -    1105382616  10068214    105042001    1017707579  8508116 94262002    1053775627  9676792 111162003    1003400478  9095202 106242004    1082612881  9079834 112002005    1326794771  10754207    137492006    1175160606  9381530 12030 -    826433846   6121305 7291 -    752279725   5463702 6436 the    442976761   2460245 2557 YearN-gramCountBook pageCountBook Volumecount26859461025 252919372   302652

A total of 300,000 volumes were scanned, and the number of generated N-gram (from Unigram to 5-gram) reached more than 26.8 billion. The N-gram of English in 4684多亿多:

....1999    9997156197  48914071    91983 -    11190986329 54799233    1034052001    11349375656 55886251    1041472002    12519922882 62335467    1172072003    13632028136 68561620    1270662004    14705541576 73346714    1396162005    14425183957 72756812    1381322006    15310495914 77883896    148342 -    16206118071 82969746    155472 -    19482936409 108811006   206272 YearN-gramCountBook pageCountBook Volumecount468491999592    2441898561  4541627

This magnitude of n-gram, whether stored or retrieved, is a great challenge to technology.
The above is the data of Google Books N-gram, in the previous years, Google also provided a web-based 1 T N-gram, the size of the following:

 Number  ofTokens1,024,908,267,229 Number  ofSentences: the,119,665,584 Number  ofUnigrams: -,588,391 Number  ofBigrams:314,843,401 Number  ofTrigrams:977,069,902 Number  ofFourgrams:1,313,818,354 Number  ofFivegrams:1,176,470,663

A total of 95 billion sentences, 1 trillion tokens, and only one year of 2006 data.
In addition to Google, Microsoft, through its Bing search, also opened the PB level (1PB = 1PeraByte = 1024x768 = 1024x768 * 1024x768 * 1024x768) of the N-gram, this level of volume, can only be placed on the cloud storage.

Resources:
Stanford University Natural Language Processing Open Course

N-gram Language Model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.