http://52opencourse.com/111/Stanford University--language model (language-modeling)--Class IV of natural language processing

**I. Introduction of the Course**

Stanford University launched an online natural language processing course in Coursera in March 2012, taught by the NLP field Daniel Dan Jurafsky and Chirs Manning:

https://class.coursera.org/nlp/

The following is the course of the study notes, to the main course ppt/pdf, supplemented by other reference materials, into the personal development, annotation, and welcome everyone in the "I love the public class" on the study together.

Courseware Summary: The Stanford University Natural Language Processing public Course Courseware summary

**Ii. language models (Language model)**

**1) N-gram Introduction**

In practical application, we often need to solve such a kind of problem: how to calculate the probability of a sentence? Such as:

**Machine translation:** P ( high winds tonite) > P (**large **winds tonite)
**spelling correction** : p (about fifteen **minutes **from) > P (on fifteen**minuets **from)
**speech recognition** : P (I saw a van) >> P (Eyes awe of an)
**Voice Word conversion** : P (What do you do now | * Nixianzaiganshenme*) > P (What are you doing in Xian | * Nixianzaiganshenme*)
**Automatic Digest, question and answer system** 、... ...

The formal representation of the above questions is as follows:

P (S) =p (w1,w2,w3,w4,w5,..., WN)

=p (W1) p (W2|W1) p (w3|w1,w2) ... p (wn|w1,w2,..., wn-1)//Chain rules

P (S) is called the language model, which is the model used to calculate the probability of a sentence.

So, how do you calculate P (wi|w1,w2,..., wi-1)? The simplest and most straightforward method is to directly count to do the division, as follows:

P (wi|w1,w2,..., wi-1) = P (w1,w2,..., Wi-1,wi)/p (w1,w2,..., wi-1)

However, there are two important problems: sparse data, too large parameter space, and cannot be applied.

Based on Markov hypothesis (Markov assumption): The occurrence of the next word depends only on one or several words in front of it.

- Assuming that the next word appears dependent on one of the words in front of it, there are:

P (S) =p (W1) p (W2|W1) p (w3|w1,w2) ... p (wn|w1,w2,..., wn-1)

=p (W1) p (W2|W1) p (w3|w2) ... p (wn|wn-1)//Bigram

- Assuming that the next word appears dependent on the two words in front of it, there are:

P (S) =p (W1) p (W2|W1) p (w3|w1,w2) ... p (wn|w1,w2,..., wn-1)

=p (W1) p (W2|W1) p (w3|w1,w2) ... p (wn|wn-1,wn-2)//Trigram

So, when we are faced with practical problems, how to choose the number of dependent words, that is, N.

- Greater N: The next word appears more constrained information, with greater
**discernment** ;
- Smaller n: more times in the training corpus, more reliable statistical information, and higher
**reliability. **

Theoretically, the greater the n the better, experience, trigram use the most, however, in principle, **can be solved with bigram, never use Trigram. **

**2) structuring the language model**

Typically, the language model is constructed by calculating the maximum likelihood estimate (Maximum likelihood Estimate), which is the best estimate for the training data, as follows:

P (w1|wi-1) = count (wi1-, WI)/count (wi-1)

"<s> I am Sam </s> for a given set of sentences"

<s> Sam I am </s>

<s> I don't like Green Eggs and Ham </s> "

Some of the Bigram language models are as follows:

C (WI) is as follows:

C (WI-1,WI) is as follows:

Then the Bigram is:

So, the probability of the sentence "**<s> I want**Chinese Food </s>" is:

P (<s> I want 中文版 food </s>) =p (i|<s>)

XP (want| I)

XP (English|want)

XP (Food|english)

XP (</s>|food)

=. 000031

In order to avoid data overflow and improve performance, it is common to use the addition operation to replace the multiplication operation after taking log.

Log (P1*P2*P3*P4) = log (p1) + log (p2) + log (p3) + log (p4)

Recommended open source language model tools:

- Srilm (http://www.speech.sri.com/projects/srilm/)
- IRSTLM (HTTP://HLT.FBK.EU/EN/IRSTLM)
- MITLM (http://code.google.com/p/mitlm/)
- BERKELEYLM (http://code.google.com/p/berkeleylm/)

Recommended Open source N-gram datasets:

- Google Web1t5-gram (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html)

Total number of tokens:1,306,807,412,486

Total number of sentences:150,727,365,731

Total number of unigrams:95,998,281

Total number of bigrams:646,439,858

Total number of trigrams:1,312,972,925

Total number of fourgrams:1,396,154,236

Total number of fivegrams:1,149,361,413

Total number of n-grams:4,600,926,713

- Google book N-grams (http://books.google.com/ngrams/)
- Chinese Web 5-gram (http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T06)

**3) Language Model evaluation**

How to determine the quality of a language model after it has been constructed? There are two main methods of evaluation at present:

- Practical method: By looking at the performance of the model in the actual application (such as spell check, machine translation) evaluation, the advantages are intuitive, practical, the shortcomings are lack of pertinence, not objective;
- Theoretical methods: Obfuscation/perplexity/Chaos (preplexity), the basic idea is to give the test set a higher probability value of the language model is better, the formula is as follows:

It is known from the formula that the smaller the degree of confusion, the greater the probability of sentence and the better the language model. Using the Wall Street Journal training data Scale to 38million words constructs the N-gram language model, the test set scale is 1.5million words, the degree of confusion is shown in the following table:

**4) Data sparse and smoothing technology**

The data sparse problem inevitably arises between the large-scale data statistic method and the limited training corpus, resulting in 0 probability problem, which conforms to the classic zip ' F law. such as IBM, brown:366m English Corpus training Trigram, in the test corpus, 14.7% of Trigram and 2.2% of Bigram in the training corpus did not appear.

Data sparse problem Definition: "The problem of data sparseness, also known as the zero-frequency problem arises when analyses contain Configurat Ions that never occurred in the training corpus. Then it isn't possible to estimate probabilities from observed frequencies, and some other estimation scheme that can Gen Eralize (that is configurations) from the training data have to be used. --dagan ".

Many attempts and efforts have been made to make the theory model practical, and a series of classical smoothing techniques have been born, and their basic idea is "to reduce the probability distribution of n-gram conditions, so that the probability distribution of non-n-gram conditions is not 0", and after smoothing the data, the probability is guaranteed and 1, in detail as follows:

**Add-one (Laplace) Smoothing**

Plus a smoothing method, also known as Laplace's law, which guarantees that each n-gram in the training corpus appears at least 1 times, taking Bigram as an example, the formula is as follows:

where V is the number of all bigram.

To undertake the example given in the previous section, after Add-one smoothing, C (Wi-1, WI) is shown below:

Then the Bigram is:

In V >> C (wi-1), that is, the majority of the training corpus N-gram not appear (generally so), Add-one smoothing after some "distracting" phenomenon, ineffective. Then, you can extend the method to mitigate this problem, such as Lidstone's Law,jeffreys-perks law.

The basic idea is to use frequency category information to smooth the frequency. Adjust the N-gram frequency for C to c*:

However, when nr+1 or Nr > nr+1, the model quality becomes worse, as shown in:

The direct improvement strategy is to "gram the number of occurrences over a certain threshold, without smoothing, with thresholds generally taking 8~10", and other methods see "Simple good-turing".

Whether it is add-one, or good Turing smoothing technology, for the non-occurrence of n-gram are not equal, inevitably exist unreasonable (event occurrence probability difference), so here again introduce a linear interpolation smoothing technology, the basic idea is to combine higher order model and low order model as linear combination, The Takamoto N-gram model is linearly interpolated using the low-N-gram model. Low-N-gram models can often provide useful information when there is not enough data to estimate the probability of Takamoto n-gram models. The formula is as follows:

Extension methods (context-sensitive) are:

The λs can be estimated by the EM algorithm, with the following steps:

- First, identify three types of data: Training, held-out data, and the test data;

- Then, the initial language model is constructed according to training data and the initial λs (for example, 1) is determined.
- Finally, the Λs is optimized iteratively based on the EM algorithm, which maximizes the held-out data probability (the following type).

**Kneser-ney Smoothing**
**Web-scale LMs**

such as the Google N-gram Corpus, Compressed file size is 27.9G, decompression after 1T or so, in the face of such a large corpus resources, before use generally need to prune (pruning) processing, reduce the scale, such as using only the frequency of more than threshold N-gram, filtering high-level n-gram (such as only using n<= 3 of resources), based on entropy pruning, and so on.

In addition, some optimizations need to be made in storage optimization, such as using trie data structure storage, Bloom filter Auxiliary query, string mapping to int type processing (based on Huffman encoding, Varint, etc.), float/ double to int type (if the probability value is exactly 6 digits after the decimal point, then multiply by 10E6, the floating-point number can be converted to an integer).

In 2007, Brants et al of Google Inc. proposed a smoothing technique for large-scale N-gram-"Stupid Backoff", with the following formula:

Data smoothing is an important method to construct a high-robustness language model, and the effect of data smoothing is related to the size of the training corpus. The smaller the size of the training corpus, the more effective the data smoothing is, and the greater the size of the training corpus, the less noticeable and even negligible the effect of the data smoothing-icing on the cake.

**5) Language model variants**

This method builds a language model based on parts of speech to alleviate the data sparse problem, and can easily fuse some grammatical information.

This method divides the training set into multiple subsets by topic, and establishes the N-gram language model for each subset to solve the topic adaptive problem of the language model. The schema is as follows:

This method uses the information from the previous moment of cache caching to compute the current moment probabilities to solve the dynamic adaptive problem of the language model.

-people tends to use words as few as possible in the article.

-if A word has been used, it would possibly is used again in the future.

The schema is as follows:

**guess** This is the current QQ, Sogou, Google and other smart Pinyin input method adopted the strategy, that is, the user personalized input log to establish a cache-based language model for the general language model output of the adjustment of the right to realize the input method of personalization, intelligence. Due to the introduction of dynamic adaptive modules, the more intelligent the product, the more useful, the more addictive.

**Skipping N-gram model&trigger-based N-gram Model**

Both of them are characterized by long-distance constrained relations.

**Exponential language model: Maximum entropy model maxent, maximum entropy Markov model MEMM, conditional stochastic domain model CRF**

The traditional N-gram language model only takes the morphological features into account, without the knowledge on the part of speech and semantic level, and the sparse problem of data, the classical smoothing technique is also solved from the statistical angle, without taking into account the linguistic functions of grammar and semantics.

MaxEnt, MEMM, CRF can be better integrated into a variety of knowledge sources, describe the characteristics of language sequences, better to solve the problem of sequence labeling.

**Iii. references**

- Lecture Slides:language Modeling
- http://en.wikipedia.org
- Guan Yi, basic Course in statistical natural language processing PPT
- Microsoft Pinyin Input Method team, the basic concept of language model
- Sho, Introduction to statistical language models
- Fandywang, Statistical language model
- Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13:359-394, October 1999.
- Thorsten Brants et al Large Language Models in machine translation
- Gale & Sampson, good-turing smoothing without tears
- Bill MACCARTNEY,NLP Lunch tutorial:smoothing,2005

P.S.: Based on this note, organized a copy of the slides, share under: http://vdisk.weibo.com/s/jzg7h

"Language model (Language Modeling)", Stanford University, Natural Language processing, lesson four