First, what is N-gram?
Wikipedia on the definition of N-gram:
N-gram is a statistical language model used to predict the nth item based on the previous (n-1) item. At the application level, the item can be a phoneme (speech recognition application), a character (input method application), a word (word breaker), or a base pair (genetic information). In general, N-gram models can be generated from large-scale text or audio corpora.
Accustomed to, 1-gram called Unigram,2-gram called Bigram,3-gram is trigram. There are Four-gram, Five-gram, etc., but the application of greater than n>5 is rare.
Second, the theoretical basis of n-gram
The idea of N-gram language model can be traced back to the research work of Shannon, a master of information theory, and he raises a question: Given a string of letters, such as "for Ex", what is the next most probable letter? From the training corpus data, we can get the n probability distribution by the method of maximum likelihood estimation: the probability of A is 0.4, the probability of B is 0.0001, the probability of C is ..., of course, don't forget the constraint: the sum of all n probability distributions is 1.
Derivation of probability formula of N-gram model. Based on conditional probability and multiplication formula:
Get
For an application, suppose T is a word sequence a1,a2,a3,... An, then P (T) =p (a1a2a3 ... AN) =p (A1) P (a2| A1) P (a3| A1A2) ... P (an| A1A2 ... AN-1)
If this is calculated directly, it is very difficult to introduce Markov hypothesis, namely: the occurrence probability of an item, only with its first m items, when m=0, is Unigram,m=1, is the Bigram model.
Therefore, P (t) can be obtained, for example, when using the Bigram model, p (t) =p (A1) p (a2| A1) P (a3| A2) ... P (an| AN-1)
and P (an| AN-1) The conditional probability can be obtained by maximum likelihood estimation, equal to Count (An-1,an)/count (An-1).
Iii. What is the length of n-gram data?
In fact, it is not rigorous to say what n-gram looks like. It is just a language model, as long as the information needed to store it, as to what format is based on the application to decide. For example, the famous Google Books Ngram Viewer, its n-gram data format is this:
circumvallate 1978 335 91circumvallate 1979 261 91
Represents a 1-gram piece of data, the first line means that the word "circumvallate" appeared 335 times in 1978, and there are 91 books. These metadata, in addition to the frequency of 335 times is necessary, other metadata (for example, and speech, etc.) can be based on the application needs. Here is a 5-gram data fragment:
isas 1991 1 1 1
Of course, it can also be other forms, for example, HANLP's N-gram model is bigram:
—@北冰洋 2—@卢森堡 1—@周日 1—@因特网 1—@地 1—@地域 1—@塔斯社 9—@尚义 12—@巴 1—@巴勒斯坦 1—@拉法耶特 3—@拍卖 1—@昆明 1
Each row represents the frequency at which two adjacent words appear together (relative to the underlying corpus).
Iv. N-gram What is the use of 4.1 cultural studies
The N-gram model looks dull and cold, but in fact, the Google Books Ngram project has spawned a new discipline (culturomics) to study human behavior and cultural trends through digitized texts. You can see the detailed description of the above. The book "Visualization of the Future" is also described in detail.
And Ted's video, "What_we_learned_from_5_million_books," is wonderful.
4.2 Word segmentation Algorithm 4.3 speech recognition 4.4 Input Method
What everyone is using every day, see: Enter "Tashiyanjiushengwude", the possible output is:
它实验救生无得他实验就生物的他是研究圣物的他是研究生物的
Which is the input of the most want to express the meaning of the technology behind this will be used to N-gram language model. Item is the possible word for each phonetic alphabet. Do you remember smart ABC? It is said to be the originator of the use of N-gram.
But Sogou input method came from behind, it uses more advanced cloud computing technology (N-GRAM model data volume is quite big, will say later)
4.5 Machine Translation Five, more understanding of N-gram
Do probability statistics know, the larger the size of corpus, make n-gram to statistical language model is more useful, for example, Google Books Ngram project, alone on the Chinese n-gram, from 1551 to 2009, the overall size is as follows:
....1999 1046431040 8988394 9256 - 1105382616 10068214 105042001 1017707579 8508116 94262002 1053775627 9676792 111162003 1003400478 9095202 106242004 1082612881 9079834 112002005 1326794771 10754207 137492006 1175160606 9381530 12030 - 826433846 6121305 7291 - 752279725 5463702 6436 the 442976761 2460245 2557 YearN-gramCountBook pageCountBook Volumecount26859461025 252919372 302652
A total of 300,000 volumes were scanned, and the number of generated N-gram (from Unigram to 5-gram) reached more than 26.8 billion. The N-gram of English in 4684多亿多:
....1999 9997156197 48914071 91983 - 11190986329 54799233 1034052001 11349375656 55886251 1041472002 12519922882 62335467 1172072003 13632028136 68561620 1270662004 14705541576 73346714 1396162005 14425183957 72756812 1381322006 15310495914 77883896 148342 - 16206118071 82969746 155472 - 19482936409 108811006 206272 YearN-gramCountBook pageCountBook Volumecount468491999592 2441898561 4541627
This magnitude of n-gram, whether stored or retrieved, is a great challenge to technology.
The above is the data of Google Books N-gram, in the previous years, Google also provided a web-based 1 T N-gram, the size of the following:
Number ofTokens1,024,908,267,229 Number ofSentences: the,119,665,584 Number ofUnigrams: -,588,391 Number ofBigrams:314,843,401 Number ofTrigrams:977,069,902 Number ofFourgrams:1,313,818,354 Number ofFivegrams:1,176,470,663
A total of 95 billion sentences, 1 trillion tokens, and only one year of 2006 data.
In addition to Google, Microsoft, through its Bing search, also opened the PB level (1PB = 1PeraByte = 1024x768 = 1024x768 * 1024x768 * 1024x768) of the N-gram, this level of volume, can only be placed on the cloud storage.
Resources:
Stanford University Natural Language Processing Open Course
N-gram Language Model