Tags: software replace random reliability hellip startup syntax OSS intelligence
http://52opencourse.com/111/Stanford University--language model (language-modeling)--Class IV of natural language processing
I. Introduction of the Course
Stanford University launched an online natural language processing course in Coursera in March 2012, taught by the NLP field Daniel Dan Jurafsky and Chirs Manning:
The following is the course of the study notes, to the main course ppt/pdf, supplemented by other reference materials, into the personal development, annotation, and welcome everyone in the "I love the public class" on the study together.
Courseware Summary: The Stanford University Natural Language Processing public Course Courseware summary
Ii. language models (Language model)
1) N-gram Introduction
In practical application, we often need to solve such a kind of problem: how to calculate the probability of a sentence? Such as:
The formal representation of the above questions is as follows:
P (S) =p (w1,w2,w3,w4,w5,..., WN)
=p (W1) p (W2|W1) p (w3|w1,w2) ... p (wn|w1,w2,..., wn-1)//Chain rules
P (S) is called the language model, which is the model used to calculate the probability of a sentence.
So, how do you calculate P (wi|w1,w2,..., wi-1)? The simplest and most straightforward method is to directly count to do the division, as follows:
P (wi|w1,w2,..., wi-1) = P (w1,w2,..., Wi-1,wi)/p (w1,w2,..., wi-1)
However, there are two important problems: sparse data, too large parameter space, and cannot be applied.
Based on Markov hypothesis (Markov assumption): The occurrence of the next word depends only on one or several words in front of it.
P (S) =p (W1) p (W2|W1) p (w3|w1,w2) ... p (wn|w1,w2,..., wn-1)
=p (W1) p (W2|W1) p (w3|w2) ... p (wn|wn-1)//Bigram
P (S) =p (W1) p (W2|W1) p (w3|w1,w2) ... p (wn|w1,w2,..., wn-1)
=p (W1) p (W2|W1) p (w3|w1,w2) ... p (wn|wn-1,wn-2)//Trigram
So, when we are faced with practical problems, how to choose the number of dependent words, that is, N.
Theoretically, the greater the n the better, experience, trigram use the most, however, in principle, can be solved with bigram, never use Trigram.
2) structuring the language model
Typically, the language model is constructed by calculating the maximum likelihood estimate (Maximum likelihood Estimate), which is the best estimate for the training data, as follows:
P (w1|wi-1) = count (wi1-, WI)/count (wi-1)
"<s> I am Sam </s> for a given set of sentences"
<s> Sam I am </s>
<s> I don't like Green Eggs and Ham </s> "
Some of the Bigram language models are as follows:
C (WI) is as follows:
C (WI-1,WI) is as follows:
Then the Bigram is:
So, the probability of the sentence "<s> I wantChinese Food </s>" is:
P (<s> I want 中文版 food </s>) =p (i|<s>)
XP (want| I)
In order to avoid data overflow and improve performance, it is common to use the addition operation to replace the multiplication operation after taking log.
Log (P1*P2*P3*P4) = log (p1) + log (p2) + log (p3) + log (p4)
Recommended open source language model tools:
Recommended Open source N-gram datasets:
Total number of tokens:1,306,807,412,486
Total number of sentences:150,727,365,731
Total number of unigrams:95,998,281
Total number of bigrams:646,439,858
Total number of trigrams:1,312,972,925
Total number of fourgrams:1,396,154,236
Total number of fivegrams:1,149,361,413
Total number of n-grams:4,600,926,713
3) Language Model evaluation
How to determine the quality of a language model after it has been constructed? There are two main methods of evaluation at present:
It is known from the formula that the smaller the degree of confusion, the greater the probability of sentence and the better the language model. Using the Wall Street Journal training data Scale to 38million words constructs the N-gram language model, the test set scale is 1.5million words, the degree of confusion is shown in the following table:
4) Data sparse and smoothing technology
The data sparse problem inevitably arises between the large-scale data statistic method and the limited training corpus, resulting in 0 probability problem, which conforms to the classic zip ' F law. such as IBM, brown:366m English Corpus training Trigram, in the test corpus, 14.7% of Trigram and 2.2% of Bigram in the training corpus did not appear.
Data sparse problem Definition: "The problem of data sparseness, also known as the zero-frequency problem arises when analyses contain Configurat Ions that never occurred in the training corpus. Then it isn't possible to estimate probabilities from observed frequencies, and some other estimation scheme that can Gen Eralize (that is configurations) from the training data have to be used. --dagan ".
Many attempts and efforts have been made to make the theory model practical, and a series of classical smoothing techniques have been born, and their basic idea is "to reduce the probability distribution of n-gram conditions, so that the probability distribution of non-n-gram conditions is not 0", and after smoothing the data, the probability is guaranteed and 1, in detail as follows:
Plus a smoothing method, also known as Laplace's law, which guarantees that each n-gram in the training corpus appears at least 1 times, taking Bigram as an example, the formula is as follows:
where V is the number of all bigram.
To undertake the example given in the previous section, after Add-one smoothing, C (Wi-1, WI) is shown below:
Then the Bigram is:
In V >> C (wi-1), that is, the majority of the training corpus N-gram not appear (generally so), Add-one smoothing after some "distracting" phenomenon, ineffective. Then, you can extend the method to mitigate this problem, such as Lidstone's Law,jeffreys-perks law.
The basic idea is to use frequency category information to smooth the frequency. Adjust the N-gram frequency for C to c*:
However, when nr+1 or Nr > nr+1, the model quality becomes worse, as shown in:
The direct improvement strategy is to "gram the number of occurrences over a certain threshold, without smoothing, with thresholds generally taking 8~10", and other methods see "Simple good-turing".
Whether it is add-one, or good Turing smoothing technology, for the non-occurrence of n-gram are not equal, inevitably exist unreasonable (event occurrence probability difference), so here again introduce a linear interpolation smoothing technology, the basic idea is to combine higher order model and low order model as linear combination, The Takamoto N-gram model is linearly interpolated using the low-N-gram model. Low-N-gram models can often provide useful information when there is not enough data to estimate the probability of Takamoto n-gram models. The formula is as follows:
Extension methods (context-sensitive) are:
The λs can be estimated by the EM algorithm, with the following steps:
such as the Google N-gram Corpus, Compressed file size is 27.9G, decompression after 1T or so, in the face of such a large corpus resources, before use generally need to prune (pruning) processing, reduce the scale, such as using only the frequency of more than threshold N-gram, filtering high-level n-gram (such as only using n<= 3 of resources), based on entropy pruning, and so on.
In addition, some optimizations need to be made in storage optimization, such as using trie data structure storage, Bloom filter Auxiliary query, string mapping to int type processing (based on Huffman encoding, Varint, etc.), float/ double to int type (if the probability value is exactly 6 digits after the decimal point, then multiply by 10E6, the floating-point number can be converted to an integer).
In 2007, Brants et al of Google Inc. proposed a smoothing technique for large-scale N-gram-"Stupid Backoff", with the following formula:
Data smoothing is an important method to construct a high-robustness language model, and the effect of data smoothing is related to the size of the training corpus. The smaller the size of the training corpus, the more effective the data smoothing is, and the greater the size of the training corpus, the less noticeable and even negligible the effect of the data smoothing-icing on the cake.
5) Language model variants
This method builds a language model based on parts of speech to alleviate the data sparse problem, and can easily fuse some grammatical information.
This method divides the training set into multiple subsets by topic, and establishes the N-gram language model for each subset to solve the topic adaptive problem of the language model. The schema is as follows:
This method uses the information from the previous moment of cache caching to compute the current moment probabilities to solve the dynamic adaptive problem of the language model.
-people tends to use words as few as possible in the article.
-if A word has been used, it would possibly is used again in the future.
The schema is as follows:
guess This is the current QQ, Sogou, Google and other smart Pinyin input method adopted the strategy, that is, the user personalized input log to establish a cache-based language model for the general language model output of the adjustment of the right to realize the input method of personalization, intelligence. Due to the introduction of dynamic adaptive modules, the more intelligent the product, the more useful, the more addictive.
Both of them are characterized by long-distance constrained relations.
The traditional N-gram language model only takes the morphological features into account, without the knowledge on the part of speech and semantic level, and the sparse problem of data, the classical smoothing technique is also solved from the statistical angle, without taking into account the linguistic functions of grammar and semantics.
MaxEnt, MEMM, CRF can be better integrated into a variety of knowledge sources, describe the characteristics of language sequences, better to solve the problem of sequence labeling.
P.S.: Based on this note, organized a copy of the slides, share under: http://vdisk.weibo.com/s/jzg7h
"Language model (Language Modeling)", Stanford University, Natural Language processing, lesson four