N-gram Language Model
Consider a speech recognition system, assuming that the user said a sentence: "I have a gun", because the pronunciation of similar, the speech recognition system found the following statements are possible candidates: 1, I have a gun. 2, I have a gull. 3, I have a gub. So the question is, which one is the right answer?
The general solution is to use statistical methods. That is, compare the above 1, 2 and 3 of the three sentences which sentence in English is the highest probability, which sentence the highest probability of which sentence to return to the user. So how do you calculate the probability of a sentence appearing? To be blunt is to "count" the method. But even "numbers" have many different methods, the simplest of which is the following:
Given a corpus, count the number of all the sentences in which the length is 4, set to N, and then look at the n length of 4 sentences, "I have a gun" how many times, may be set as N0, then the sentence "I have a gun" probability is n0/n. The probabilities of the other two sentences are calculated as well.
This method of counting is logically completely OK, but because of the flexible nature of language, and the size of the corpus is always limited, for a slightly longer sentence, it is probably not in the corpus. For example, the following sentence: "I am looking for a restaurant to eat breakfast", intuitively, this sentence in corpus should appear a lot of times it? But if you enter this sentence into Google's search box and click Search, you will find that there is no exact match on the returned results. Therefore, we need to propose a more effective "counting" approach.
In order to make things clear, it is necessary to introduce some simple mathematical symbols.
1. Word sequence: W1, W2, W3, ..., WN
2. Chain rule: P (W1, W2, W3, ..., WN) =p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1)
Well, we want to calculate the probability of "I have a gun", that is, to calculate P (i,have,a,gun), according to the chain rules, there are:
P (I,have,a,gun) =p (I) p (have| I) P (a| I,have) P (gun| I,HAVE,A)
But things have not been simplified, for example, to calculate P (gun| I,have,a), unfold according to the conditional probability formula:
P (gun| i,have,a) = P (i,have,a,gun)/P (i,have,a)
What did you find out? In order to calculate P (gun| I,HAVE,A), we need to calculate P (I,have,a,gun) and P (i,have,a) first. Hey? P (I,have,a,gun) is not the value we want to calculate at the beginning? So we went around and we were back in place?
OK, now let's tidy up the idea.
For a sentence, it can be represented as a word sequence: W1, W2, W3, ..., WN. We now want to calculate the probability of the occurrence of a sentence, that is, p (W1, W2, W3, ..., WN). This probability can be solved directly with a number of methods, but the effect is not good, so we use the chain rules, the calculation p (W1, W2, W3, ..., WN) into the calculation of a series of products: P (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1). But after the transformation, the problem did not become simple. What to do?
N-gram is in handy now.
For 1-gram, it is assumed that P (wn|w1w2...wn-1) ≈p (wn|wn-1)
For 2-gram, it is assumed that P (wn|w1w2...wn-1) ≈p (wn|wn-1,wn-2)
For 3-gram, it is assumed that P (wn|w1w2...wn-1) ≈p (wn|wn-1,wn-2,wn-3)
In turn.
So:
Under the 1-gram model:
P (W1, W2, W3, ..., WN) =p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1)
≈p (W1) p (W2|W1) p (W3|W2) p (w4|w3) ... P (wn|wn-1)
Under the 2-gram model:
P (W1, W2, W3, ..., WN) =p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1)
≈p (W1) p (W2|W1) p (W3|W1W2) p (w4|w2w3) ... P (wn|wn-2wn-1)
Under the 3-gram model:
P (W1, W2, W3, ..., WN) =p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1)
≈p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|wn-3wn-2wn-1)
Assuming that we are using the 1-gram model, then:
P (I,have,a,gun) =p (I) p (have| I) P (a|have) p (gun|a).
Then we use the "number" method to find P (I) and the other three conditional probabilities:
P (i) = number of occurrences of I in corpus/total words in corpus
P (have| i) = number of occurrences of I in corpus and the number of occurrences of the corpus (i).
Summing up, this article is only a very simple introduction to N-gram, the purpose is simple and easy to understand, but not rigorous. Interested students can further access to the relevant information. N-gram's content can be found in any book on natural language processing.
Popular Understanding N-gram language model. Go