Srilm-language model-n-gram Basic Introduction

Source: Internet
Author: User

From: http://hi.baidu.com/isswangqing/item/1b8e3ad096c286be32db9033

N-gram is a commonly used language model. Based on this assumption, the appearance of N words is only related to the previous N-1 words, it is irrelevant to any other word. The probability of a sentence is the product of the probability of occurrence of each word. These probabilities can be obtained by counting the number of times N words appear simultaneously in the corpus.

If W1 W2 W n is a string with a length of N, the probability of occurrence of W is:

For example, P ("John Read a book") = P ("John") × P ("Read | John") × P ("A | John Read ") × P ("Book | John Read ")

In the n-gram model, the value of N has a great impact on the algorithm performance and time-space overhead.

N is large
1. More context information is provided, and the context is more distinctive.
2. However, the number of parameters is large, the computing cost is high, the training corpus needs to be large, and the parameter estimation is unreliable.
N hours
1. The context information is small and non-distinctive.
2. However, the number of parameters is small, the calculation cost is small, the training corpus does not need to be too large, and the parameter estimation is reliable.

Generally, n = 2 (bigram) and n = 3 (trigram) are suitable.

Assume that the training corpus is as follows:
<Bos> JOHN read Moby Dick <EOS>
<Bos> Mary Read a different book <EOS>
<Bos> she read a book by Cher <EOS>

<Bos> and <EOS> are Sentence Boundaries, indicating the beginning and end of a sentence. If n = 2 is used, there are:

Calculate the probability of a sentence John reads a book
P (John Read a book) = P (John | <Bos>) × P (read | John) × P (A | read) × P (Book |) × P (<EOS> | book) = 0.06

 

 

From: http://blog.sina.com.cn/s/blog_623e3c050100m31g.html

Statistical Language Model

Suppose a sentence s can be expressed as a sequence S = w1w2... Wn, the language model is the probability P (s) that requires sentence s ):

The calculation amount of this probability is too large. The solution to the problem is to convert all historical w1w2... The wi-1 maps to the equivalence class S (w1w2... Wi-1), the number of equivalence classes is much smaller than the number of different historical, that is, assume:

N-gram model

When two recent N-1 words (or words) of history are at the same time, two historical maps are mapped to the same equivalence class. In this case, the model is called the n-gram model. The n-gram model is called the first-order Markov chain. The value of N cannot be too large. Otherwise, the calculation is still too large. According to the maximum likelihood estimation, the parameters of the language model are as follows:

Among them, C (w1w2... WI) indicates w1w2... Number of times that wi appears in the training data

Introduction of Smoothing Technology

The traditional estimation method has the following requirements for the sample capacity n of n Independent observations with a random variable of £:

N> K

K indicates the number of values that can be obtained from random variables.

The actual language model often cannot meet this requirement. For example, there are 140 possible tags for part-of-speech tagging, and a third-level model considering the influence of the two words before and after the current word.

K = 140*140*140 = 2,744,000

A training set of about 0.1 million words is given, that is

N =, which indicates that the training data is very insufficient.

Assume that K refers to an event, and N (k) indicates the observed frequency of event k. The maximum likelihood method uses relative frequency as the probability estimation of event k:

P (K) = N (k)/n

In the language model, a large number of events N (K) = 0 in the training corpus clearly do not reflect the actual situation. We call this problem a data sparse problem. This zero-value probability estimation will lead to the failure of the language model algorithm. For example, if the probability value is used as the multiplier, the result will be 0, and the log operation cannot be performed. Counting equivalence class
According to the principle of symmetry, events should not have detailed features except the number of occurrences, that is, all events with the same count r = N (k) K (the number of occurrences of events is called the Count of events) they should have the same probability estimation value. These events with the same count are called counting equivalence, and an equivalence class consisting of them is recorded as counting equivalence class gr. For a counting equivalence class with a count of R, nr is defined as the number of members in the equivalence class, PR is the probability of an event in the equivalence class, and r is the maximum number of possible counts, then
Cross-validation
The crossover test divides the training sample into M parts, one of which is used as the reserved part, and the other is used as the training part. The training part is used as the estimated probability PR of the training set, and the reserved part is used as the test set for testing. We use CR to indicate the number of observations of the counting equivalence class with the count as R in the reserved part. For the reserved part, use the maximum likelihood method to estimate the probability PR, even if the logarithm likelihood function is maximized:
Use the Laplace multiplier to solve the maximum value problem under constraints, that is, evaluate the partial direction of PR and obtain the crossover test estimation: if the test part is also used as the reserved part, is the typical maximum likelihood estimation:
Leave an estimate
Leave one method is the extension of the Cross-test method, the basic idea is to divide the given n samples into N-1 samples as the training part, the other sample as the retention part. This process lasted for n times so that each sample was used as a reserved sample. Advantage: make full use of the given sample. For each observation in N, leave a method to simulate the situation that is not observed again. For the retention method, the maximum likelihood estimation of PR is:
Turing-Good Formula
Because nrpr can be ignored compared with 1, leaving an estimation formula can be similar to: leaving an estimate can use the Count r = 1 event to simulate the event that is not present. The following is an estimate of the event that is not present:

This formula is the famous Turing-good formula.

Null equivalence class

Leave an estimate that the required nr is not 0. In actual problems, when r = 5, this requirement is generally not met, that is, the Count equivalence class G1 ,..., There is an empty equivalence class in gr. Sort by number of occurrences: the number of corresponding R (l) events is recorded as NR (l, use the next non-null equivalence class GR (L + 1) instead of the possible null equivalence class GR (l) + 1. Leave an estimation formula to change:

No probability is estimated for the null equivalence class, because the null equivalence class does not correspond to any valid event. Advantages and disadvantages of Turing-good estimation and ApplicabilityDisadvantages: (1) it is impossible to guarantee the "orderliness" of probability estimation. That is, the probability of an event with more than one occurrence is greater than the probability of an event with fewer occurrences. (2) PR and R/n cannot be very similar. Good estimation should ensure Pr <= r/n. Advantage: the basis of other smoothing technologies. Applicability: estimate small count events with 0 <r <6. Constraint leave an estimateMonotonic constraint: pr-1 <= Pr; Discount constraint: P <= r/n. Constraint leave an estimate: Let the count estimate R * = Pr • n be in the absolute frequency closest to it:
Under this constraint, the monotonic constraint is naturally satisfied. Calculation method: When M is calculated, check whether each PR meets the constraint. Otherwise, crop the upper and lower bounds of the constraint, re-calculate m, and iterate until all PR meet the constraint. Discount ModelKatz pointed out that the Turing-good formula is essentially to discount the event observed in the model and spread the probability of the discount to the non-existing event of the N0. Under the guidance of this idea, the estimation formula can be in the following form: DR is a discount function for counting events that count as R. Absolute Discount ModelIf the discount function is defined as Dr = B, B is a constant greater than 0. Then, the total probability of the undiscovered event is: The estimated formula corresponding to the absolute discount model is: Linear Discount ModelIf the discount function is defined as Dr = A · R, A is a constant greater than 0. The total probability of the undiscovered event is: The estimated formula corresponding to the linear discount model is: if a = N1/N, n0p0 = N1/N, which is the same as the Turing-good estimation.
Delete interpolation (Deleted)
Interpolation)
The basic idea is that because n-gram is more likely to appear than N + 1-gram, n-gram is used to estimate the probability of N + 1-gram, for example, the trigram formula is as follows:

Where,

Parameter l determination: the training data is divided into two parts, one is used to estimate F (WI | w1w2... Wi-1), part is used to calculate the parameter l, to find the language model of the minimum degree of confusion L.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.