Statistical language Model statistical language models __statistical

Source: Internet
Author: User
Learning notes from the Stanford Open Class Natural Language processing (https://class.coursera.org/nlp/), with the main handout, add their own learning understanding to deepen the learning impression.
Content outline:
1. N-gram Introduction 2. Parameter estimated 3. Evaluation of language models 4. Data sparse problem 5. Smoothing method

N-gram Introduction
Now in many applications, it is necessary to calculate the probability of a sentence, whether a sentence is reasonable, look at its probability size, the size of the probability is measured by probabilities.  For example, the following examples: in machine translation: P (high winds Tonite) > P (large winds tonite) spell check: For example: The Office is About Fiieen minuets from I House obviously P (about fiieen minutes from) > P (about Fiieen minuets F ROM) speech recognition: For example, I saw van and eyes awe of an are sounding similar, but P (I saw a van) >> P (Eyes awe an)
In the above examples, we need to calculate the probability of a sentence as a basis for judging whether it is reasonable or not. The above contents are described formally below. We need to calculate the probability of a sentence or sequence W: P (w) = P (w 1, W 2, W 3, W 4, W 5 ... w N) of which we also need to compute a related task, such as P (w 5 |w 1, W 2, W 3, W 4), indicating W 1 W 2 W 3 W 4 is followed by the probability of W 5, which is the probability of the next word. The model for calculating P (w) or P (w n |w 1, w 2 ... w n-‐1) like this is called the language model (language, LM).
So how do you calculate P (W). With the chain rule of probability, chain rules are often used to evaluate the joint probability of random variables, the chain rules are as follows:

The above chain rule calculation p (W) can be written as follows:

According to the chain rule calculation method, examples are as follows: P ("its water are so transparent") = P (IT) XP (water|its) XP (Is|its water) XP (So|its water is) XP (tr Ansparent|its water is so) The following question is how to calculate each probability above, such as P (Transparent|its water is so), a more intuitive calculation is to count and then use division:


In fact, this is not the way to calculate the conditional probability, for two reasons: 1. This calculation directly will cause the parameter space is too large, a language model parameter is all of these conditional probabilities, imagine the above way to calculate P (w 5 |w 1, W 2, W 3, W 4), here w I have a dictionary size of the possibility of value, recorded as | V|, the number of parameters for the model is | V|^5, and this does not contain the number of P (w 4 | W1, W2, W3), you can see that to calculate the conditional probability will make the number of language model parameters too much to be useful. 2. The data is sparse and serious, my understanding is to count the computation like the above, for instance the counting molecule its water be so transparen, the number of occurrences in the text that we can see is very small, so that the result of calculation is that too much conditional probability equals 0, Because we don't see enough text to count.
The above calculation method is simplified by Markov hypothesis, which assumes that the term WI is only related to the K word in front of it, so that we can get the previous conditional probability calculation simplified as follows:

So our P (W) calculation is simplified as follows:


When k = 0 o'clock, this time the corresponding model is called a metamodel (Unigram model), that is, the WI is related to the 0 words in front of it, that is, the WI is not related to any word, each word is independent of each other, P (W) is calculated as follows:


When k = 1 o'clock, the corresponding model is called the two-Bigram model, at which point the WI is related to a word in front of it, and P (W) is computed as follows:


Similarly, we can let K = 2, called Trigrams,4-grams,5-grams, when k = N-1, the model becomes an N-ary model, that is, N-grams.
In general, N-grams has some drawbacks because there is a long distance dependency on language, such as considering the following sentence:
"The computer which I had just put in the machine room on the fifth floor."
If we want to predict the probability of the last word crashed, if we use the two-dollar model, then the actual association between crashed and floor should be very small, on the contrary, the subject computer of this sentence is very relevant to crashed, But N-grams did not capture this information.



parameter Estimation
To calculate the conditional probability in the model, these conditional probabilities are also called model parameters, and the process of obtaining these parameters is called training. Calculate the following conditional probabilities with maximum likelihood:


where C (wi-1) represents the number of times the wi-1 appears, is the first letter of Count C. See an example of a small piece of text:


Where <s><\s> represents the beginning and end of a sentence, S is the meaning of start. The two-dollar model for calculating this text is as follows:

For example, the probability that P (I | <s>) represents I as the beginning of a sentence is calculated:


To take another example, this example comes from a larger corpus, in order to compute the parameters of the corresponding two-element model, that is, the P (WI | wi-1), we first count the C (WI-1,WI), then Count C (wi-1), and then divide it to get these conditional probabilities. You can see that for C (WI-1,WI), Wi-1 has a corpus dictionary size (Remember | v|) of possible values, WI is also, so C (WI-1,WI) to calculate the number of | V|^2. The results of the C (WI-1,WI) count are as follows:


The result of the Count of C (WI-1) is as follows:



Then the parameters of the two-element model are calculated as follows:


For example, the calculation of P (Want | i) = 0.33 is as follows:


So after the two-dollar model for this corpus is established, we can calculate our goal, that is, the probability of a sentence, an example is as follows:
P (<s> I want 中文版 food </s>) = P (i|<s>) XP (want| I) XP (english|want) XP (Food|english) XP (</s>|food) =. 000031
Let's take a look at some of the actual information captured by the two-dollar model:


In fact, probabilities are often calculated in logarithmic spaces for two reasons: 1. To prevent overflow, you can see that if the sentence is very long, the final result will be very small, or even overflow, such as the probability of calculation is 0.001, then assume 10 as the result of the logarithm is 3, so that will not overflow. 2. The addition in the logarithmic space can be substituted for multiplication, because log (p1p2) = Logp1 + LOGP2, while in the computer, it is obvious that the addition is faster than the multiplication.
Build N-gram model Here the teacher recommended the Open Source Toolkit, Srilm (http://www.speech.sri.com/projects/srilm/), and the Open source N-gram dataset: http://ngrams.googlelabs.com/



Evaluation of language models
We can build the language model, in general we get the language model parameters in the training set, in the test set to test the performance of the model, then how to measure the good and bad of a language model. Compare two models a,b good or bad, a kind of external evaluation is to put AB into specific tasks, and then get the accuracy of the model, which is certainly the best way, but the disadvantage of this approach is too time-consuming, in the actual situation often need to spend too much time to get results. Another way is to use the puzzle described below, but attention to the confusion is not a good approximation of the external evaluation, so the general use in the pilot experiment, the so-called pilot experiment is a small-scale preliminary study to evaluate some performance.
Degree of Confusion
The basic evaluation method of perplexity is to give the model of high probability value to test set better, the confusion degree (PP) of a sentence w is defined as follows:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.