**Ext.: http://blog.csdn.net/lanxu_yy/article/details/29918015**

**Why do you need a language model? **
Imagine "speech recognition" Such a scenario, the machine through a certain algorithm to convert speech to text, obviously this process is and error-prone. For example, the user pronounced "recognize Speech", the machine may correctly recognize the text as "recognize Speech", but it can also be accidentally identified as "wrench a nice beach". Simply from the lexical analysis, we can not get the correct recognition, but the computer does not understand the grammar, then we should deal with this problem? A simple and easy way is to use the statistical method (Markov chain) to determine the probability of each recognition of the correct possibility.

**What is a language model? **
First, we define a finite dictionary of v. V = {The, A, Man, Telescope ...}, with a finite or infinite Gerdica between dictionaries, we can get an infinite string combination s,s may contain:

1,the

2,a

3,the Mans

4,the Mans Walks

...

Second, suppose we have a training data set that contains many articles in the dataset. We can calculate the frequency of each sentence by counting the sentences appearing in the dataset, the number of occurrences C (x), and the total number of sentences in the dataset. X∈s,p (x) = C (x)/n indicates the frequency of X, apparently σp (x) = 1.

In summary, we can find a few questions:

1) The above language model only exists theoretically, when the training data set is infinitely large, the frequency of the data set can be infinitely close to the actual probability in the grammar;

2) for most sentences in S, p (x) should be equal to 0, so S is a very sparse data set that is difficult to store.

**Markov chain**
Since the above simple language model is not perfect, we naturally need to find other ways to get the language model, one of the more famous algorithms is the Markov chain. If you consider a sentence of length n can be expressed by a series of random variables, namely x1, x2, ... xn, of which xk∈v. So, our goal is to ask P (X1 = x1, X2 = x2, ..., Xn = Xn).

Obviously, p (X1 = x1, X2 = x2, ..., Xn = Xn) = P (X1 = x1) * p (X2 = x2 | X1 = x1) * p (X3 = x3 | X1 = x1, X2 = x2) * ... * p (Xn = Xn | X1 = x1, X2 = x2, ... Xn-1 = xn-1). When n is too large, the complexity of conditional probabilities will increase greatly, can we find an approximate method to calculate these conditional probabilities easily? The answer is yes, we need to make a hypothesis that the random variable of each word is only related to the first k random variables.

- First-order Markov chains

In the first order Markov chain we think that the random variable of each word is only related to the previous random variable, so the above expression can be simplified to p (X1 = x1, X2 = x2, ..., Xn = Xn) = P (X1 = x1) * p (X2 = x2 | X1 = x1) * p (X3 = x3 | X2 = x2) * ... * p (Xn = Xn | Xn-1 = xn-1) = P (X1 = x1) *∏p (Xk = Xk | Xk-1 = xk-1)

In the Ishimarkov chain we think that the random variable of each word is only related to the first two random variables, so the above expression can be simplified to p (X1 = x1, X2 = x2, ..., Xn = Xn) = P (X1 = x1) * p (X2 = x2 | X1 = x1) * p (X3 = x3 | X1 = x1, X2 = x2) * ... * p (Xn = Xn | Xn-2 = xn-2, Xn-1 = xn-1) = P (X1 = x1) * p (X2 = x2 | X1 = x1) *∏p (Xk = Xk | Xk-2 = xk-2, Xk-1 = xk-1) Usually the length n is not fixed, and for the sake of presentation convenience, we can do some optimization of the details. 1) Add a start symbol "*", we define all sentences starting with "*", i.e. X-1 = X0 = *;2) to add an end symbol "Stop", we define all sentences to end with "stop". In summary, the Markov chain expression can be simplified as: first-order Markov chain: P (X1 = x1, X2 = x2, ..., Xn = Xn) =∏p (Xk = Xk | Xk-1 = xk-1) Ishimarkov chain: P (X1 = x1, X2 = x2, ..., Xn = Xn) =∏p (Xk = Xk | Xk-2 = xk-2, Xk-1 = xk-1)

**the language model of Ishimarkov**With Ishimarkov, we can redefine the language model: 1) A finite dictionary V2) for each trigram (three consecutive words) define a parameter q (w | u, v), W∈v∪{stop},u, v∈v∪{*}3) for any sentence x1, x2, ... x N, where x-1 = x0 = *,xn = stop,xk (k = 1, 2, ..., n-1) ∈v, the probability of the occurrence of the sentence P (x1, x2, ... xn) =∏q (xk = XK | Xk-2 = xk-2, Xk-1 = xk-1) For example, for the sentence the dog barks stop, we can do the following analysis: P (the dog barks stop) = Q (The | *, *) * Q (Dog | *, the) * Q (barks | The, dog) * Q (STOP | dog, barks)

**The calculation of the ****language model of Ishimarkov**
The computational Ishimarkov language model seems to be a simple statistical problem, by counting the number of occurrences of the successive three words in the training data set C (U, V, W) and two words (U, v), q (w | u, v) = C (U, V, W)/C (U, v).

The algorithm looks perfect here, and if you have a good training data set, we can train the language model. But we also mentioned a problem, that is, the sparse nature of the training data set. Only when the data set is infinitely large can we guarantee that all possible sentences will be included in the language model, otherwise we will not be able to obtain a reliable language model. In addition, if any one trigram (three consecutive words) corresponds to q (w | u, v) = 0, we will catastrophic get the probability of the sentence equal to 0.

In real life, we cannot get infinite training data sets, because we want to find an approximate algorithm to be all trigram (three consecutive words) corresponding to the Q (w | u, v) are not 0, at the same time as possible to find with the fact that close to the Q (w | u, v). A common approach is to use Unigram (single word), Bigram (two words), trigram (three words) for approximate calculations.

Unigram:q ' (w) = C (w)/C ()

Bigram:q ' (w | v) = C (V, W)/C (v)

Trigram:q ' (w | u, v) = C (U, V, W)/C (U, v)

We define Q (w | u, v) = K1 * q ' (w | u, v) + K2 * q ' (W | v) + K3 * q ' (w), where K1 + k2 + K3 = 1,ki >= 0.

First, we need to prove that σq (w | u, v) = 1. i.e. Σq (w | u, v) =σ[k1 * q ' (w | u, v) + K2 * q ' (W | v) + K3 * q ' (w)] = K1 *σq ' (w | u, v) + K2 *σq ' (W | v) + K3 *σq ' (w) = K1 * 1 + k2 * 1 + K3 * 1 = k1 + K2 + K3 = 1.

Secondly, should we determine K1, K2 and K3? Using a method like cross-Entropy, we can make Q (w | u, v) as close as possible to the statistical results of the training set. Suppose C ' (U, V, W) is the number of occurrences of trigram (three words) statistically from the test set, L (K1, K2, K3) =σc ' (U, V, w) * LOG[Q (w | u, v)]. The critical point to satisfy L (K1, K2, K3) is to determine K1, K2, K3.

Measurement of Ishimarkov language model: complexity (perplexity)

Suppose we have a test data set (a total of M sentences), each sentence Si corresponds to a probability p (SI), so the probability product of the test data set is ∏p (SI). After simplification, we can get Log∏p (si) =σlog[p (si)]. perplexity = 2^-l, where L = 1/mσlog[p (SI)]. (like the definition of entropy)

A few intuitive examples:

1) Suppose Q (w | u, v) = 1/m,perplexity = M;

2) | v| = 50000 Trigram Model of the data set, perplexity = 74;

3) | v| = 50000 Bigram model of the data set, perplexity = 137;

4) | v| = 50000 of the Unigram model of the dataset, perplexity = 955.

NLP | Natural language Processing-language model (Language Modeling)