MIT natural Language Processing third: Probabilistic language Model (part fourth)
Natural language Processing: Probabilistic language model
Natural Language processing:probabilistic Language Modeling
Author: Regina Barzilay (Mit,eecs Department, November 15, 2004)
Translator: I love natural language processing (www.52nlp.cn, January 20, 2009)
Four, smoothing algorithm
A) Maximum likelihood estimation (Maximum likelihood estimate)
I. MLE makes the training data as "large" (MLE makes training as probable as possible):
P_{ML} (W_{i}/{w_{i-1},w_{i-2}}) = {Count (w_{i-2},w_{i-1},w_{i})}/{count (W_{i-2},w_{i-1})}
1. For a corpus with a vocabulary size of N, we will get the parameters of N^{3} (for vocabulary of size n, we'll have N3 parameters in the model)
2. For n=1000, we will estimate 1000^{3}=10^{9} parameters (for N =1, we have to Estimate1, 000^{3}=10^{9} parameters)
3. Question (Problem): How do I deal with unregistered words (how to deal with unseen words)?
Ii. Data sparse issues (sparsity)
1. The total probability of unknown events constitutes a significant portion of the test data (the aggregate probability of unseen events constitutes a large fraction of the test)
2. Brown et al (1992): Consider a 350 million-word English corpus, 14% ternary is unknown (considered a million word corpus of 中文版, 14% of Trigrams are UN Seen)
Iii. Note: A brief supplement to MLE
1. Maximum likelihood estimation is a statistical method, which is used to find the parameters of the relative probability density function of a sample set. This method was first used by geneticists and statisticians, Sir Ronald Fei, from 1912 to 1922.
2. "Likelihood" is a relatively close to the classical Chinese translation of likelihood, "likelihood" in the modern language is "possibility." Therefore, it is more understandable to call it the "maximum likelihood estimate".
3.MLE selected parameters make the training corpus have the highest probability, it does not waste any probability in the training corpus does not appear in the event
4. However, the MLE probability model is not usually suitable for the statistical language model of NLP, and 0 probability is not allowed.
b How do you estimate the probability of unknown elements (how to estimate probability of unseen elements)?
I. Discounts (discounting)
1. Laplace plus 1 smoothing (Laplace)
2. Good-turing Discount Method (good-turing)
Ii. linear interpolation method (Linear interpolation)
Iii. Katz Retreat (Katz Back-off)
c) plus one (Laplace) smoothing (Add-one (Laplace) smoothing)
I. The simplest method of discounting (simplest discounting technique):
{P (w_{i}/w_{i-1})} = {C (w_{i-1},w_{i}) +1}/{c (W_{i-1}) +v}
Here V is the number of vocabularies-the "type" of the Corpus (where |ν| is a vocabulary size)
Note: MIT courseware seems to be wrong here, I have modified
Ii. Bayesian estimation assumes an even distribution before the event occurs (Bayesian Estimator assuming a uniform unit prior on events)
Iii. question (Problem): There is too much probability of the unknown being taken up (Too a lot probability mass to unseen events)
Iv. Example (Example):
Suppose v=10000 (Word type), s=1000000 (CI case) (Assume |ν| =10, and S=1, 000, 000):
P_{MLE} (Ball/{kike~a}) = {{Count (Kike~a~ball)}/{count (kick~a)} = 9/10 = 0.9
P_{+1} (Ball/{kike~a}) = {{Count (Kike~a~ball) +1}/{count (kick~a) +v}} = {9+1}/{10+10000} = 9*10^{-4}
V. Laplace's shortcomings (weaknesses of Laplace)
1. For sparse distributions, the Laplace law gives too many probability spaces to unknown events (for Sparse distribution, Laplace's laws gives too much of the probability space to unseen Events
2. The actual probability of predicting the two-yuan syntax is very poor compared to other smoothing methods (worst at predicting the actual probabilities of bigrams than)
3. Using Epsilon Smoothing is more reasonable (more reasonable to use add-epsilonsmoothing (Lidstone ' s))
To be continued: part Fifth
Attached: Course and courseware pdf download mit English page address:
http://people.csail.mit.edu/regina/6881/
Note: This article according to the MIT Open Curriculum Creation sharing specification translation release, reprint please specify the source "I love Natural Language processing": www.52nlp.cn
from:http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-fourth-part/
MIT natural Language Processing third: Probabilistic language model (part fifth)
Natural language Processing: Probabilistic language model
Natural Language processing:probabilistic Language Modeling
Author: Regina Barzilay (Mit,eecs Department, November 15, 2004)
Translator: I love natural language processing (www.52nlp.cn, February 10, 2009)
V. Good-turing Discount method (good-turing discounting)
A how likely you are to see a new word in the future. Estimate the probability of an unknown event with the event you are seeing (how likely are a new word type in the future? Use things "ve seen once to estimate" probability of unseen things)
I. n_r--of elements (n-ary syntax) with a frequency of R count and r>0 (number of elements with R frequency and r>0)
Ii. n_0--Total vocabulary size minus the size of the observed vocabulary, which occurs with an n-ary syntax with a number of 0 (size of the sum lexicon minus the size of observed lexicon)
Iii. for elements with a frequency of R, the correction count is (Modified count for elements with frequency R):
r^* = (r+1) *{n_{r+1}/n_r}
(b) Additional notes on the good-turing discount law:
I. Good (1953) first describes the good-turing algorithm, and the original idea of this algorithm comes from Turing.
Ii. The basic idea of good-turing smoothing is to estimate the size of probability by looking at a higher number of n-ary grammars and assigning it to n-ary grammars with 0 or lower counts.
(c) Intuitive good-turing discount method (good-turing discounting:intuition)
I. Purpose (GOAL): Estimating the frequency of the number of words in the training data that count R in the same scale test set (estimate how often word with r counts in training data occurs in test set of equ Al size).
Ii. We use deletion estimates (we deleted estimation):
1. Delete one word at a time
2. If the word "test" appears r+1 in all datasets (if "test" word occurs r +1 times in complete data set):
--It appears in the training set R times (it occurs r time in "training" set)
--Add 1 to the word Count R (add one count to words with R counts)
Iii. R-count The total count in the word "bucket" is (totals count placed to bucket for r-count words is):
n_{r+1}* (r + 1)
IV. The average count is:
(Avg-count of R count words) = {n_{r+1}* (r+1)}/n_r
(d) Continuation of good-turing discounts (good-turing discounting (cont.)):
I. In good-turing, the total probability assigned to all unknown events equals n_1/n, where N is the size of the training set. It is similar to the relative frequency formula assigned to independent events.
Ii. in good-turing, the total probability assigned to all unobserved the events are equal ton_1/n, where N is the size of The training set. It is the same as a relative frequency formula would assign to singleton events.
E) Example (example:good-turing)
Training sample of 22,000,000 (Church&gale ' 1991))
R N_r Heldout r^*
0 74,671,100,000 0.00027 0.00027
1 2,018,046 0.448 0.446
2 449,721 1.25 1.26
3 188,933 2.24 2.24
4 105,668 3.23 3.24
5 68,379 4.21 4.22
6 48,190 5.23 5.19
f) Additional notes:
I. According to the ZIPF law, for small R, the N_r ratio is large, for large r,n_r, for the N-ary group with the most occurrences, r*=0!
Ii. so for the occurrence of a lot of n-tuple, GT estimate is not allowed, and MLE estimates are relatively accurate, so you can directly use MLE. GT estimation is generally applicable to n-tuple III with a number of K (k<10) occurrences. If so, consider the "maxi", where the "rich" has become a "middle class"! Oh, the Real Millionaire glory! (although the millionaire is nothing to lose) even the discount law does not dare to bully the rich. This is the origin of "heartless" and "skinflint". To be continued: part sixth
Attached: Course and courseware pdf download mit English page address:
http://people.csail.mit.edu/regina/6881/
Note: This article according to the MIT Open Curriculum Creation sharing specification translation release, reprint please specify the source "I love Natural Language processing": www.52nlp.cn
from:http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-fifth-part/
MIT natural Language Processing third: Probabilistic language model (part sixth)
Natural language Processing: Probabilistic language model
Natural Language processing:probabilistic Language Modeling
Author: Regina Barzilay (Mit,eecs Department, November 15, 2004)
Translator: I love natural language processing (www.52nlp.cn, February 16, 2009)
Vi. interpolation and fallback
A) the bias-variance trade-off
I. Not-smoothed ternary model estimation (unsmoothed trigram estimate):
P_ML ({w_i}/{w_{i-2},w_{i-1}}) ={count (W_{i-2}w_{i-1}w_{i})}/{count (W_{i-2},w_{i-1})}
Ii. an unsmoothed two-dollar model estimate (Bigram estimate):
P_ML ({w_i}/{w_{i-1}}) ={count (W_{i-1}w_{i})}/{count (W_{i-1})}
Iii. an unsmoothed model estimation (Unigram estimate) with no smoothing:
P_ML ({w_i}) ={count (W_{i})}/sum{j}{}{count (W_{j})}
IV. Which of these different estimates is closest to the "real" P ({w_i}/{w_{i-2},w_{i-1}}) probability (how close are these different estimates to the "true" probability p ({ W_i}/{w_{i-2},w_{i-1}))?
b) interpolation (interpolation)
I. A method to solve the data sparsity problem of ternary model is to mix the two-dollar model and the unary model that are less affected by the sparse data in the model (one way of solving, the sparseness in a trigram model be to mix, With Bigram and Unigram models the suffer less from data sparseness).
II. Weights can be set using the desired maximization algorithm (EM) or other numerical optimization techniques (the weights can be set using the expectation-maximization algorithm or another numerical Optimization technique)
Iii. linear interpolation (Linear interpolation)
Hat{p} ({w_i}/{w_{i-2},w_{i-1}}) ={lambda_1}*p_ml ({w_i}/{w_{i-2},w_{i-1}})
+{LAMBDA_2}*P_ML ({w_i}/w_{i-1}) +{lambda_3}*p_ml ({w_i})
Here {Lambda_1}+{lambda_2}+{lambda_3}=1 and {lambda_i}>=0 for all I
Iv. parameter estimation (Parameter estimation)
1. Remove part of the training set as "Validation" data (Hold out parts of training set as "validation")
2. Defines the number of occurrences of count_2 (W_1,w_2,w_3) as a validation set of ternary w_1,w_2,w_3 (Define count_2 (w_1,w_2,w_3) to be the "times" Trigram W_1 , W_2,w_3 is seen in validation set)
3. Select {lambda_i} to maximize (Choose {lambda_i} to maximize):
L ({lambda_1},{lambda_2},{lambda_3}) =sum{(w_1,w_2,w_3) in{upsilon}}{}{count_2 (w_1,w_2,w_3)}log{hat{P}} ({W_3}/{w_ 2,w_1})
Here {Lambda_1}+{lambda_2}+{lambda_3}=1 and {lambda_i}>=0 for all I
Note: For the other content of parameter estimate, because the formula is too many, here slightly, please refer to the original courseware
c) Kats regression model-two yuan (Katz back-off Models (bigrams)):
I. Define two sets (Define two sets):
A (W_{i-1}) =delim{lbrace}{w:count (w_{i-1},w) >0}{rbrace}
B (W_{i-1}) =delim{lbrace}{w:count (w_{i-1},w) =0}{rbrace}
Ii. a two-dollar model (a Bigram models):
P_k ({w_i}/w_{i-1}) =delim{lbrace}{matrix{2}{2}{{{count^{*} (w_{i-1},w)}/{count (w_{i-1})}>0} {If{w_i}{in}{A (w_{ I-1}}} {alpha (W_{i-1}) {{p_ml (w_{i})}/sum{w{in}b (W_{i-1})}{}{p_ml (W)}} {If{w_i}{in}{b (W_{i-1})}}}}{}
{alpha (W_{i-1}) =1-sum{w{in}a (W_{i-1})}{}{{count^{*} (w_{i-1},w)}/{count (W_{i-1})}}
Iii. count^* definition (count^*definitions)
1. Kats for Count (x) <5 The Good-turing method is used for count (x) >=5 order count^* (x) =count (x) (Katz uses Good-turing method for Count (x) < 5, and count^* (x) =count (x) for Count (x) >=5)
2. "Kneser-ney" ("Kneser-ney" method):
count^* (x) =count (x)-D, where d={n_1}/{n_1+n_2}
N_1 is the number of elements with a frequency of 1 (n_1 is a and elements with frequency 1)
N_2 is the number of elements with a frequency of 2 (N_2 is a and elements with frequency 2)
Vii. Overview
A) The weakness of the N-ary model (weaknesses of N-gram Models)
I. What's the idea (any ideas)?
Short distance (short-range)
Medium distance (mid-range)
Long Distance (long-range)
b) More precise models (more refined Models)
I. Class-based model (class-based models)
Ii. structural model (structural models)
Iii. themes and long-distance models (topical and long-range models)
c) Summary (Summary)
I. Starting with a thesaurus (start with a vocabulary)
Ii. Select a model (select type of mode)
Iii. parameter estimation (estimate Parameters)
d) Toolkit Reference:
I. Cmu-cambridge Language Modeling Toolkit:
Http://mi.eng.cam.ac.uk/~prc14/toolkit.html
Ii. Srilm–the SRI Language Modeling Toolkit:
http://www.speech.sri.com/projects/srilm/
The third lecture is over.
Chapter Four: Labeling
Attached: Course and courseware pdf download mit English page address:
http://people.csail.mit.edu/regina/6881/
Note: This article according to the MIT Open Curriculum Creation sharing specification translation release, reprint please specify the source "I love Natural Language processing": www.52nlp.cn
from:http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-sixth-part/