First, brief introduction a) predicting string probabilities I. That string is more likely or more consistent with syntax 1. Grill doctoral candidates. 2. Grill Doctoral updates. (Example from Lee 1997) Ii. The method of assigning probabilities to strings is called a language model (Methods for assigning probabilities to strings is called Language Models.) b) Motivation (motivation) I. Speech recognition, spelling Write inspection, optical character recognition and other fields of application (Speech recognition, spelling correction, optical character recognition and other applications) Ii. Let E be a physical evidence (? Unsure of the translation), we need to decide whether the string w is a message that has e encoded (let E was physical evidence, and we need to determine whether the string W is the Coded by E) III. Using Bayesian rules (use Bayes rule): P (w/e) ={p_{lm} (W) p (e/w)}/{p (E)} where P_{lm} (W) is the language model probability IV. P_{LM} (W) provides the necessary disambiguation information (P_{LM} (W) provides the information necessary for isambiguation (esp. when the physical evidence was not Sufficient for disambiguation)) c) How do I calculate (how to Compute it)? I. Naïve method (Naive approach): 1. Use maximum likelihood estimation--the value of the number of occurrences of a string in corpus s is normalized by the corpus scale: P_{mle} (Grill~doctorate~candidates) ={count (grill~ doctorate~candidates)}/delim{|} S {|} 2. For unknown events, maximum likelihood estimation p_{mle}=0 (for unseen Events, p_{mla sparseness) d) Two well-known sentences (two Famous sentences) e}=0)--Data sparse problem comparison "scary" (dreadful behavior in the presence of Dat I. "It's fair to assume that Neither sentence "Colorless green ideas sleep fu riously" nor "furiously sleep ideas green colorless" ... has ever OCC Urred ... Hence, in any statistical model ... these sentences would be ruled off on identical grounds as equally "remote" from Englis H. Yet (1), though nonsensical, is grammatical, while (2) is not. " [Chomsky 1957] Ii. Note: This is the 9th page of Chomsky's syntactic structure: The following two sentences have never appeared in an English conversation, from a statistical point of view from the same "distant" English, but only sentence 1 is grammatical: 1) colorless green ideas sleep furiously. 2) furiously sleep ideas sleep green colorless. "Never appeared in an English conversation", "from the statistical perspective of the same ' distant '" to see from which angle to see, if the specific vocabulary, from the form of the angle of view, I am afraid that the sentence 1 of the statistical frequency is higher than sentence 2 and appeared in English. Second, the language model constructs a) The language model question proposed (the Language Modeling problem) I. Begins with some vocabulary sets (start with some vocabulary): ν= {The, a, doctorate, candid Ate, professors, grill, cook, ask, ...} II. Get a training sample with the vocabulary set V-off: Grill doctorate candidate. Cook professors. Ask professors. .... III. Hypothesis (assumption): The training sample is made up of some hiddenDistribution p characterization of Iv. Target (Goal): Learn a probability distribution P prime as far as possible with P approximate sum{x in V}{}{p Prime (x)}=1, p Prime (x) >=0 p Prime (candidates) =10^{-5} {P Prime (ask~candidates)}=10^{-8}b) obtains the language model (deriving Language model) I assigns probabilities to a set of Word sequences w_{1}w_{2}...w_{n} (Assign Probability to a sequencew_{1}w_{2}...w_{n} ) Ii. Apply the chain rule (apply chain rule): 1. P (w1w2...wn) = P (w1| S)? P (w2| S,W1)? P (w3| S,W1,W2) ... P (e| S,W1,W2,..., WN) 2. Based on the "history" model (History-based models): We predict future events from past Events 3. What scope of context do we need to consider? c) Markov hypothesis (Markov assumption) I. For any length of Word sequence P (wi|w (i-n) ... w (i?1)) is more difficult to predict Ii. Markov hypothesis (Markov assumption): The first word WI only relies on the first n words III. Ternary grammar model (also known as Ishimarkov model): 1. P (wi| Start,w1,w2,..., W (i?1)) =p (Wi|w (i?1), W (i?2)) 2. P (w1w2...wn) = P (w1| S)? P (w2| S,W1)? P (W3|W1,W2)?... P (E|w (n?1), WN) d) A language computing model (a computational model of Language) I. A useful concept and practice device: "Coin toss" model 1. Generate sentences from random algorithms-generators can be one of many "states"-toss a coin to determine the next state-toss another coin to decide which letter or Word to Output II. Shannon (Shannon): "The states would correspond to the" residue of influence "from preceding letters" E) based on word approximation Note: The following is the training with ShakespeareMachine-generated sentences, you can refer to the "Natural language Processing comprehensive theory" I. Unary syntax approximation (here the MIT courseware is wrong, not the first-order approximation (First-order approximation)) 1. To him swallowed confess hear both. Which. of Save 2. On trail for is AY device and rote life has 3. Every enter now severally so, let 4. Hill he late speaks; or! A more to leg less first you 5. Enter II. Ternary grammar approximation (here the courseware is wrong, not the third-order approximation (Third-order approximation)) 1. King Henry. what! I'll go seek the traitor Gloucester. 2. Exeunt some of the watch. A Great Banquet serv ' s in; 3. Would you tell me how I am? 4. It cannot is but so. III. evaluation of language models a) evaluate a language model I. We have n Test string: s_{1},s_{2},..., S_{n} ii. Consider the probability of this string of words under our model: Prod{i=1}{n}{p (S_{i})} or logarithmic probability (or log probability): Log{prod{i=1}{n}{p (S_{i})}}=sum{i=1}{n}{logp (S _{i})} III. Degree of Confusion (perplexity): perplexity = 2^{-x} here x = {1/w}sum{i=1}{n}{logp (s_{i})} W is the total number of words in the test data (W is the sum of the words In the test data.) Iv. Perplexity is an effective "branching factor" method (Perplexity is a measure of effective "branching factor") 1. We have a word set of size n, model predictions (we have a vocabulary V of size n, and modeL predicts): P (W) = 1/n for all words in V (for all the words in V.) v. What is the degree of perplexity (what is about perplexity)? perplexity = 2^{-x} here x = log{1/n} so perplexity = N VI. Assessment of Human behavior (estimate of human Performance (Shannon, 1951) 1. Shannon Games (Shannon game)-people guess the next letter in a text (humans guess next letters in text) 2. pp=142 (1.3 bits/letter), uncased, open vocabulary vii. Evaluation of ternary language models (estimate of Trigram language model (Brown et al. 1992)) pp=790 (1.75 bits/letter), cased, open vocabulary four, smoothing algorithm A) Maximum likelihood estimate (Maximum likelihood Estimate) I. MLE makes the training data as large as possible: p_{ml} (W_{i}/{w_{i-1},w_{i-2}}) = {Count (w_{i-2},w_{i-1},w_ {i})} /{count (W_{i-2},w_{i-1})} 1. For a corpus with a vocabulary size of n, we are going to get the parameters of N^{3} in our model (for vocabulary of size n, we'll have N3 parameters the model) 2. For n=1000, we are going to estimate 1000^{3}=10^{9} parameters (for N =1, and we have to Estimate1, 000^{3}=10^{9} parameters) 3. Question (problem): How do I handle non-signed words? Ii. Data sparse problem (sparsity) 1. The total probability of an unknown event constitutes a large portion of the test data 2. Brown et al (1992): Considering a 350 million-word English corpus, 14% of the three-word is unknown iii. Note: A brief supplement to MLE 1. The maximum likelihood estimation is a statistical method, whichA parameter that is used to find the relevant probability density function of a sample set. This approach was first used by geneticists and the statisticians, Sir Ronald Fisher, between 1912 and 1922. 2. "Likelihood" is a relatively close-to-classical translation of likelihood, "likelihood" in modern Chinese. Therefore, it is more understandable to call the "maximum probability estimate". The parameters of the 3.MLE selection make the training corpus have the highest probability, it does not waste any probability in the training corpus does not appear in the event of 4. However, the MLE probability model is usually not suitable for the statistical language model of NLP, and 0 probability will appear, which is not allowed. b) How to estimate the probability of an unknown element? I. Discount (Discounting) 1. Laplace plus 1 Smoothing (Laplace) 2. Good-turing Discount Method (good-turing) Ii. Linear interpolation method (Linear interpolation) iii. Katz fallback (Katz Back-off) c) plus one (Laplace) smoothing I. The simplest method of discounting (simplest discounting technique): {P (w_{i}/w_{i-1})} = {C (w_{i-1},w_{i }) +1}/{c (W_{i-1}) +v} Here V is the number of glossaries-the corpus of "type" Note: MIT courseware here seems to be wrong, I have modified II. Bayesian estimation assumes that the event occurs before an even distribution of iii. Problem (problem): There is too much probability of an unknown event taking up the IV. Example (Example): hypothesis v=10000 (Word type), s=1000000 (Word case) (assume |ν| =10, S=1, and, P_{mle} (Ball/{kike~a}) = {{Count (Kike~a~ball)}/{count (kick~a)}} = 9/10 = 0.9 P_{+1} (Ball /{kike~a}) = {{Count (Kike~a~ball) +1}/{count (kick~a) +v}} = {9+1}/{10+10000} = 9*10^{-4} v. Laplace's shortcomings (weaknesses of Laplace) 1. For sparse distributions, the Laplace rule gives too many probability spaces of unknown events 2. The actual probability of predicting the two-dollar syntax is very poor compared to other smoothing methods (3. Use Add epsilon smoothing more reasonable some five, Good-turing Discount method (Good-turing Discounting) a) How likely are you to see a new word in the future? The probability of an unknown event is estimated with the observed event i. n_r--the element (n-ary syntax) of the R frequency and the r>0 ii. n_0--total vocabulary size minus the observed vocabulary size, which appears as a number of 0 n-ary syntax Iii. For elements with a frequency of R, the correction count is: r^* = (r+1) *{n_{r+1}/n_r}b) Supplementary note on good-turing discounting: I. Good (1953) first describes the good-turing algorithm, and the original idea of this algorithm comes from TU Ring. II. The basic idea of good-turing smoothing is to re-estimate the probability magnitude by observing a higher n-ary syntax number and assigning it to n-ary syntax with a 0 count or a lower count. c) Intuitive good-turing discount method (good-turing discounting:intuition) I. Purpose (GOAL): estimate the frequency of words in the training data that are counted as r in the same scale test set (estimate how often Word with r counts in training data occurs in test set of equal size). Ii. We use the delete estimate (we used deleted estimation): 1. Delete one word at a time 2. If the word "test" appears in all data sets R+1 (if "test" word occurs r +1 times in complete data set):--It appears in the training set R times (it occurs r time in "tr Aining "Set"--adds 1 to the word Count R (add one count to words with R counts) Iii. R-count the total count in the word "bucket" is (totals count placed to buckets for r-count words is): n_{r+1}* (R + 1) iv. The average count is: (Avg-count of R count words) = {n_{r+1}* (r+1)}/n_r d) good-turing Discount (good-turing Discounting (cont)): I. In good-turing, the total probability assigned to all unknown events equals n_1/n, where N is the size of the training set. It is similar to the relative frequency formula assigned to an independent event. Ii. in good-turing, the total probability assigned to all the unobserved events are equal ton_1/n, where N is the size of The training set. It is the same as a relative frequency formula would assign to Singleton EVENTS.E) example (example:good-turing) Training sample of 22,000,000 (Church&gale ' 1991)) R N_r heldout r^*0 74,671,100,000 0.00027 0.000271 2,018,046 0.448 0.4462 449,721 1.25 1.263 188,933 2.24 2.244 105,668 3.23 3.245 68,379 4.21 4.226 48, 5.19f 5.23) Additional instructions: I. According to Zipf Law, for small r, n_r, for large r,n_r small, for the most occurrences of the N-tuple, r*=0! II. Therefore, for a lot of n-tuple, GT estimation is not allowed, and the MLE estimate is relatively accurate, so you can directly adopt MLE. The GT estimate is generally applicable to n-tuple III with a number of occurrences of K (k<10). If so, consider "Maxi", the "rich" here becomes the "middle class" classes! Hehe, the real millionaire dipping! (though the millionaire is a little bit less) even the discount law dare not bully the rich! This is the origin of "heartless" and "skinflint". Hex, interpolation and fallback a) the Bias-variance trade-off I. Non-smoothed ternary model estimation (unsmoothed trigram estimate): p_ml ({w_i}/{w_{i-2},w_{i-1}}) ={count (W_{i-2}w_{i-1}w_{i})}/{count (W_{i-2},w_{i-1})} Ii. Non-smoothed two-element model estimate (unsmoothed bigram estimate): P_ml ({w_i}/{w_{i-1}}) ={count (W_{i-1}w_{i})}/{count (W_{i-1})} Iii. Non-smoothed unary model estimate (unsmoothed unigram estimate): P_ml ({w_i}) ={count (W_{i})}/sum{j}{}{count (W_{j})} Iv. Which of these different estimates is closest to the "real" p ({w_i}/{w_{i-2},w_{i-1}) probability (how close is these different estimates to the "true" probability P ({w_i}/{w_{i-2},w_{i-1}}))? b) interpolation (interpolation) I. One way to solve the problem of data sparse in ternary model is to mix the two-element model and the one-dimensional model with the smaller effect of the data sparsity in the model (a single-way of solving The sparseness in a trigram model was to mix this model with Bigram and Unigram models this suffer less from data sparsenes s). Ii. weights can be set using the desired maximization algorithm (EM) or other numerical optimization techniques (the weights can be set using the expectation-maximization algorithm or another numerical Optimization technique) Iii. Linear interpolation (Linear interpolation) Hat{p} ({w_i}/{w_{i-2},w_{i-1}}) ={lambda_1}*p_ml ({w_i}/{w_{i-2},w_{i-1}}) +{lambda_2}*p _ML ({w_i}/w_{i-1}) +{lambda_3}*p_ml ({w_i}) is here {Lambda_1}+{lambda_2}+{lambda_3}=1 and {lambda_i}>=0 for all I IV. Parameter estimation (Parameter estimation) 1. Take a portion of the training set as "Validation" data (hold out part of training set as "validation", data) 2. Defines the number of occurrences of count_2 (W_1,w_2,w_3) as a validation set of triples w_1,w_2,w_3 (Definecount_2 (w_1,w_2,w_3) to be the Times the trigram w_1,w_2,w_3 is seen in validation set) 3. Select {lambda_i} to maximize (Choose {lambda_i} to maximize): L ({lambda_1},{lambda_2},{lambda_3}) =sum{(W_1,w_2,w_3) In{upsilon}}{}{count_2 (W_1,w_2,w_3)}log{hat{p}} ({w_3}/{w_2,w_1}) here {Lambda_1}+{lambda_2}+{lambda_3}=1 and {Lambda_ i}>=0 for all I note: For other contents of parameter estimation, due to too many formulas, please refer to original courseware C) Kats fallback model-two yuan (Katz back-off Models (Bigrams)): I. Define two sets (Define Sets): A (W_{i-1}) =delim{lbrace}{w:count (w_{i-1},w) >0}{rbrace} B (W_{i-1}) =delim{lbrace}{w:count (w_{i-1},w) = 0}{rbrace} II. A two-yuan model (a bigram models): P_k ({w_i}/w_{i-1}) =delim{lbrace}{matrix{2}{2}{{{count^{*} (w_{i-1},w)}/{count (W_{i-1})} >0} {if{w_i}{in}{a (W_{i-1})}} {alpha (W_{i-1}) {{p_ml (w_{i})}/sum{w{in}b (W_{i-1})}{}{p_ml (W)}}} {If{w_i}{in}{B (W_{i-1})}} }}{}{alpha (W_{i-1}) =1-sum{w{in}a (W_{i-1})}{}{{count^{*} (w_{i-1},w)}/{count (W_{i-1})}} (iii). count^* definition (count^*definitions) 1. Kats for Count (x) <5 using the Good-turing method, for Count (x) >=5 order count^* (x) =count (x) (Katz uses Good-turing method for Count (x) < 5, and count^* (x) =count (x) for Count (x) >=5) 2. "Kneser-ney" Methods ("Kneser-ney" method): count^* (x) =count (x)-D, where d={n_1}/{n_1+n_2} n_1 is the number of elements with a frequency of 1 (N_1 is a Number of elements with frequency 1) n_2 is a frequency of 2 elements (N_2 is a numbers of elements with frequency 2) Vii. Overview A) The weaknesses of the N-ary model (Wea Knesses of N-gram Models) I. What is the idea (any ideas)? Short distance (short-range) medium distance (mid-range) long distance (long-range) b) more accurate model (more refined Models) I. Class-based model (class-based Models) ii. Structured model (Structural models) iii. Theme and long distance model (topical and long-range models) c) Summary (Summary) I. Starting from a glossary (start with a vocabulary) ii. Select a model (select type of model) III. Parameter estimation (Estimate Parameters) d) Toolkit reference: I. Cmu-cambridge language modeling toolkit:http://mi.eng.cam.ac.uk/~prc14/ Toolkit.html II.SRIlm-the SRI Language Modeling toolkit:http://www.speech.sri.com/projects/srilm/
MIT Natural Language Processing Third speaking: probabilistic language model