Let wi be any word in the text, if it is known in the text of the first word wi-1, you can use conditional probability P (WI | wi-1) to predict the probability of Wi appearance. This is the concept of the statistical language model. In general, if the variable W is used to represent any word sequence in the text, it is composed of N words in the ordered order, that is, W = w1w2... the statistical language model is the probability P (w) of the word Sequence W in the text ). Using the product formula of probability, P (w) can be expanded:
P (w) = P (W1) P (W2 | W1) P (W3 | w1w2)... P (WN | W1 W2. .. wn-1)
To predict the probability of occurrence of a word wn, you must know the probability of occurrence of all the words before it. From the computing point of view, this method is too complicated. If the probability of occurrence of any word WI is only related to the first word (Markov hypothesis), the problem can be greatly simplified. The language model is called Bi-gram ):
P (w) ≈ P (W1) P (W2 | W1) P (W3 | W2 )... P (WN | wn-1)
If the probability of occurrence of any word WI is only related to the first two words, the language model is called the tri-gram ):
P (w) ≈ P (W1) P (W2 | W1) P (W3 | w2w1) P (W4 | w3w2 )... P (WN | wn-1wn-2)
In general, the N-element model assumes that the probability of occurrence of the current word is only related to the N-1 of the word before it. It is important that these probability parameters can be calculated through large-scale corpus. For example, the Trielement probability is
P (WI | wi-2wi-1) ≈ count (wi-2wi-1wi)/count (wi-2wi-1)
Count (...) indicates the cumulative number of times a specific word sequence appears in the entire corpus.
Here is an example of using bigram. Assume that the total number of words in the corpus is 15000
The following table lists the number of times the word "He is a computer doctor" appears:
He |
2500 |
Yes |
3000 |
Computer |
100 |
Doctor |
85 |
Graduate Student |
196 |
The word sequence frequency is shown in the following table.
|
He |
Yes |
Computer |
Doctor |
Graduate Student |
He |
6 |
1900 |
20 |
15 |
10 |
Yes |
150 |
8 |
80 |
65 |
80 |
Computer |
0 |
300 |
1 |
50 |
100 |
Doctor |
5 |
50 |
5 |
2 |
110 |
Graduate Student |
3 |
30 |
6 |
3 |
8 |
P (he is a computer doctor)
= P (HE) P (Yes | he) P (Computer | yes) P (Doctor | computer) P (graduate student | doctor)
= (2500/15000) * (1900/2500) * (80/3000) * (50/100) * (110/196)
Here we need to explain the data sparse problem. Suppose there are 10000 words in the word table. If bigram is used, there may be 100002 n-grams. If trigram is used, the possible n-gram contains 100003! The combination of many word pairs does not appear in the corpus, and the probability obtained based on the maximum likelihood estimation will be 0, which will cause great trouble, when calculating the probability of a sentence, once one of the items is 0, the probability of the entire sentence will be 0. The final result is that our model can only calculate a few pitiful sentences, the probability of most sentences is 0. therefore, we need to perform data smoothing (Data
Smoothing), there are two purposes of data smoothing: one is to make the sum of all n-gram probabilities 1, so that all n-gram probabilities are not 0.
The purpose of data smoothing is clear. How can we achieve this goal? This is what Goode-Turing estimated to solve. The principle of good-Turing's estimation is: for an event that has not been seen, we cannot think that the probability of its occurrence is zero, therefore, assign a very small proportion from the total probability to these unseen events. In this way, the total probability of the events you see is smaller than 1. Therefore, you need to reduce the probability of the events you want to see. The smaller the discount, the more untrusted the discount is.