N-gram Statistical language model 1. Statistical language model natural language begins with its creation, Gradually evolved into a context-sensitive way of expressing and transmitting information, so that the basic problem of making computers work with natural languages is to create mathematical models for the context-related characteristics of natural languages. This mathematical model is often said in natural language processing Statistical language model , It's Today Basic of all natural language processing 2.n-gramN-gram is a language model commonly used in the continuous speech recognition of large words, and for Chinese, we call it the Chinese language model (CLM, Chinese Language models). The Chinese language model uses the collocation information between the adjacent words in the context, when it is necessary to convert the consecutive no-space pinyin, the stroke, or the number representing the letter or the stroke to the Chinese character string (i.e. sentence), can calculate the sentence with the maximal probability, thus realizes the automatic conversion of the Chinese character, without the user manual choice, Avoids the re problem of many Chinese characters corresponding to an identical pinyin (or stroke string, or number string). sogou Pinyin and the main idea of Microsoft Pinyin is the N-gram model, but in the inside more added some linguistic rules. 3. Using mathematical methods to describe the law of languageBen Bernanke, the Fed's chairman, told the media yesterday that $700 billion of bailout money would be lent to hundreds of banks, insurers and car companies. (This sentence means very fluent, the meaning is also very clear)change the order of some words, or replace some words, this sentence becomes: Ben Bernanke's Federal Reserve chairman yesterday told the media to lend to hundreds of banks, insurers and car companies for $700 billion in bailout money. (The meaning is vague, though how much can be guessed one o'clock). But if we replace it with: The United main U.S. reserve transcript, the south will borrow days of the rescue of the media support funds of 70 yuan billion 00 dollars to Baijia silver Yasuyuki, auto insurance companies and. (Basically the reader is unintelligible). before the 1970s, the scientists tried to determine whether the word sequence was grammatically correct or not, but the road was not moving. Jarinik looked at the problem from another angle, and it was pretty done with a simple statistical language model. Jarinik's starting point is simple:
If a sentence is reasonable, see how likely it is. As for probability, it is measured by probabilities. The first sentence appears the most probable, so the first sentence is most likely to have a reasonable sentence structure. The more general and rigorous description of this method is that it is assumed that s represents a meaningful sentence, consisting of a sequence of words w1,w2,w3,..., wn, where n is the length of the sentence. Now, I want to know the likelihood of S appearing in the text (corpus), which is mathematically referred to as the probability P (s) of S. We need a model to estimate the probability, since S=W1,W2,W3,..., WN, then you might want to expand P (s) to represent: P (s) =p (w1,w2,w3,..., wn)using the formula of conditional probability, the probability of the occurrence of S is equal to the conditional probability of each word appearing, so p (W1,..., WN) expands to:, P (S) =p (w1,w2,w3,..., Wn) =p (W1) p (w2| W1) P (w3| W1,W2) ... P (wn| W1,W2,..., Wn-1)
where P (W1) represents the probability that the first word W1 appear, and P (W2|W1) is the probability that the second word appears under the premise that the first word is known, and so on, the probability of the occurrence of a word wn depends on all the words in front of it. Additional knowledge: A detailed explanation of conditional probability and Bayesian formula
However, there are two fatal defects in this method: One flaw is that the parameter space is too large (the probability of conditional probability p (wn|w1,w2,..., wn-1) is too much, cannot be estimated), and the other flaw is that the data is sparse and serious.
Sparse interpretation of data:Suppose there are 20,000 words in the glossary, if it is the Bigram model (two metamodel) then there are 400 million possible 2-gram, such asThe fruit is trigram (3-yuan model), then there are 8 trillion possible 3-gram! So the combination of many of these pairs of words does not appear in the corpus, according to the maximum likelihood estimationThe probability will be 0, which will cause a lot of trouble, when the probability of calculating a sentence once one of the items is 0, then the probability of the whole sentence will be 0, the final result is, our model can only be a pathetic number of sentences, and most of the sentences are calculated as a probability of 0. So we're going to do data smoothing (data smoothing),The purpose of smoothing is two: one is to make
all the n-gram probability of the sum of 1, so that
all n-gram probabilities are not 0, the data smoothing method can refer to the "Mathematical Beauty" on page 33rd of the content.
4. Markov hypothesis
In order to solve the problem of excessive parameter space, Markov hypothesis is introduced: the probability of the occurrence of any word is only related to the limited one or several words appearing in front of it. If the probability of the occurrence of a word is only related to a word appearing in front of it, then we call it the Bigram model (two-metamodel). That
P (S) = P (w1,w2,w3,..., Wn) =p (W1) p (w2| W1) P (w3| W1,W2) ... P (wn| W1w2 ... WN-1)
≈p (W1) P (w2| W1) P (w3| W2) ... P (Wi) | P (Wi-1) ... P (wn| WN-1)
If the appearance of a word depends only on the two words that appear in front of it, then we call it Trigram (ternary model).
In practice the most used is bigram and trigram, and the effect is very good. The use of more than four yuan is very small, because training it (to find parameters) need a larger corpus, and data sparse serious, high time complexity, the accuracy is not much improved. Of course, it can also be assumed that the appearance of a word is determined by the previous N-1, the corresponding model is slightly more complex, known as the N-ary model.
5. How to estimate the conditional probability problemconditional probability derivation is explained in detail on page 30th of the beauty of mathematics, where a simple conditional probability is described. a simple estimation method is the maximum likelihood estimate (Maximum likelihood Estimate), i.e. P (wn| W1,W2,..., Wn-1) = (c (w1,w2,..., Wn))/(C (W1, W2,..., Wn-1)). C (w1,w2,..., wn) is a sequence w1,w2,..., the number of times the WN appears in the Corpus. for the two-dollar model P (wi| Wi-1) =c (WI-1,WI)/C (Wi-1) (Maximum likelihood estimation is a statistical method used to find the parameters of the relevant probability density function of a sample set , Click to open the link for a detailed explanation ).
6. In a corpus example
Note: This corpus is in English, and for Chinese character corpus, it is necessary to sentence participle in order to do further natural language processing. statistical sequence C (W1 W2 ...) in the training corpus. Wn) The number of occurrences and C (W1 W2 ... Wn-1) Number of occurrences.
Let's use Bigram for an example. Suppose the corpus has a total of 13,748 words
P (I want to eat Chinese food)
=p (I) *p (want| I) *p (to|want) *p (eat|to) *p (chinese|eat) *p (food| Chinese)
=0.25*1087/3437*786/1215*860/3256*19/938*120/213
=0.000154171
The probability of want food eat with I to Chinese is much lower than I want to eat Chinese food, so the latter sentence structure is more reasonable.
Note: P (wang| i) =c (I Want) | C (I) =1087/3437
Many information on the Internet, table 1 words and word frequency and table 2 Word sequence frequency is not, so the article is not clear.
for 1. High-order Language Model 2. Model Training, 0 probability problems and smoothing methods, corpus selection and other issues, "The beauty of mathematics" are explained in detail, this is no longer outlined.
N-gram Statistical language model