Concept
Statistical language model: It is a mathematical model to describe the inherent law of natural language. Widely used in various natural language processing problems, such as speech recognition, machine translation, Word segmentation, part-of-speech tagging, and so on. Simply put, a language model is a model used to calculate the probability of a sentence.
That is P (w1,w2,w3 .... WK). Using a language model, you can determine which word sequence is more likely, or given several words, to predict the next most likely word.
N_gram Language Model
- Briefly
In NLP, people can use Ngram to predict or evaluate whether a sentence is reasonable based on a certain corpus. On the other hand, Ngram can be used to evaluate the degree of difference between two strings, which is a commonly used method in fuzzy matching. It is widely used in machine translation, speech recognition, printing and handwriting recognition, spelling correction, Chinese character input and literature query.
Introducing the Ngram Model
Suppose S denotes a meaningful sentence, consisting of a sequence of words w1,w2,w3,.., WN, n is the length of the sentence. Want to know the likelihood that s appears in the text (corpus), which is mathematically called probability P (S):
P(S)=P(w1,w2,w3,..,wn)=P(S)=P(w1,w2,w3,..,wn)=P(W1)P(W2|W1)P(W3|W1,W2)..P(Wn|W1,W2,..,Wn?1)
However, there are two fatal defects in such a method:
The number of parameters is too large: the probability of conditional p (wn|w1,w2,.., wn-1) is too many, can not be estimated, can not be useful;
Data sparse serious: for the combination of very many pairs of words, in the corpus does not appear, according to the maximum likelihood estimate will be 0. The final result is that our model can only be a pathetic number of sentences, and most of the sentences are calculated as a probability of 0.
- Markov hypothesis
In order to solve the problem that the parameter space is too large two proposed Markov hypothesis points out: the probability that a word appears randomly is only the limited one or several words that appear in front of him .
If the appearance of a word depends only on a word that appears in front of it, then we call it bigram(then the n=2 of the Ngram model):
P(S)=P(w1,w2,w3,..,wn)=P(W1)P(W2|W1)P(W3|W1,W2)..P(Wn|W1,W2,..,Wn?1)≈P(W1)P(W2|W1)P(W3|W2)..P(Wn|Wn?1)
Suppose that the appearance of a word depends only on the two words that appear in front of it, then we call it trigram(then n=3):
P(S)=P(w1,w2,w3,..,wn) =P(W1)P(W2|W1)P(W3|W1,W2)..P(Wn|W1,W2,..,Wn?1) ≈P(W1)P(W2|W1)P(W3|W2,W1)..P(Wn|Wn?1,Wn?2)
In general, theN-ary model assumes that the probability of the occurrence of the current word is only related to the N-1 word in front of it . And these probability parameters can be calculated by large-scale corpus, more than four yuan of the use of very little, because training it needs a larger corpus, and data sparse serious, time complexity is high, the accuracy is not much improved.
- Data smoothing
For language, maximum likelihood is not a good parameter estimation method because of the sparse existence of data. The solution, we call it "smoothing technology". (The beauty of mathematics)
There are two purposes for data smoothing:
- One is to make the sum of all ngram probabilities 1;
- The second is to make all the ngram probabilities are not 0.
The main strategy is to reduce the probability of events occurring in the training samples appropriately, and then assign the reduced probability density to events that are not present in the training corpus.
(Data smoothing technology slightly)
N_gram model Application (brief)
- String distances defined based on the Ngram model
The key to fuzzy matching is how to measure the "difference" between the two very similar words (or strings), which is often referred to as "distance". In addition to defining the editing distances between two strings (usually using the Needleman-wunsch algorithm or the Smith-waterman algorithm), you can also define the ngram distance between them.
Using Ngram model to evaluate whether the statement is reasonable
From a statistical point of view, the natural language of a sentence S can be composed of any word string, but the probability P (S) is very small. For example:
S1 = 我刚吃过晚饭S2 = 刚我过晚饭吃
Obviously, for Chinese S1 is a fluent and meaningful sentence, and S2 is not, so for Chinese P (S1) >p (S2).
Another example is that if we give an excerpt of a sentence, we can actually guess what the following words should be, such as:
她真在认真....
Suppose we now have a corpus as follows, which is the end of the sentence tag:
<s1><s2>yes no no no no yes</s2></s1><s1><s2>no no no yes yes yes no</s2></s1>
Our task below is to evaluate the probability of the following sentence:
<s1><s2>yes no no yes</s2></s1>
Let's demonstrate the results of using the Trigram model to calculate probabilities:
P(yes|<s1>,<s2>)=1/2, P(no|yes,no)=1/2, P(</s2>|no,yes)=1/2, P(no|<s2>,yes)=1 **P(yes|no,no)=2/5** P(</s1>|yes,</s2>)=1
So the probability we ask is equal to:
1/2x1x1/2x2/5x1/2x1=0.05
Text classifier based on Ngram model
As long as each class of corpus training their own language model, in essence, each category has a probability distribution, when a new text, as long as the language model according to their respective languages, the probability of the occurrence of this text in each language model, the probability of the text in which model, this text belongs to which category!
The application of Ngram in language recognition
When deciding what language to use for a new document, we first create a Ngram profile of the document and calculate the distance between the new document profile and the language profile. The calculation of this distance is based on the "out-of-place measure" between two profiles. Select the shortest distance, which indicates that this particular document belongs to that language. This is to introduce a threshold that is used to report that the language of the document cannot be judged or judged incorrectly when any distance exceeding the threshold is present.
Speech recognition examples
nixianzaizaiganshenme 你现在在干什么? 你西安载感什么?
Its corresponding pronunciation is exactly the same, then if we use the language model, we will calculate the probability of this sentence:
P(“你”|“<s>”,“<s>”)P(“现在”|“你”,“<s>”)P(“在”|“你”,“现在”)P(“干什么”|“在”,“现在”)
Much larger than
P(“你”|“<s>”,“<s>”)P(“西安”|“你”,“<s>”)P(“载”|“西安”,“你”)P(“感”|“西安”,“载”)P(“什么”|“在”,“现在”)
Hidden Markov model
Cond
NLP (iii) _ Statistical language model