Popular Understanding N-gram language model.

Last Update:2018-08-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

N-gram Language Model

Consider a speech recognition system, assuming that the user said a sentence: "I have a gun", because the pronunciation of similar, the speech recognition system found the following statements are possible candidates: 1, I have a gun. 2, I have a gull. 3, I have a gub. So the question is, which one is the right answer?

The general solution is to use statistical methods. That is, compare the above 1, 2 and 3 of the three sentences which sentence in English is the highest probability, which sentence the highest probability of which sentence to return to the user. So how do you calculate the probability of a sentence appearing? To be blunt is to "count" the method. But even "numbers" have many different methods, the simplest of which is the following:

Given a corpus, count the number of all the sentences in which the length is 4, set to N, and then look at the n length of 4 sentences, "I have a gun" how many times, may be set as N0, then the sentence "I have a gun" probability is n0/n. The probabilities of the other two sentences are calculated as well.

This method of counting is logically completely OK, but because of the flexible nature of language, and the size of the corpus is always limited, for a slightly longer sentence, it is probably not in the corpus. For example, the following sentence: "I am looking for a restaurant to eat breakfast", intuitively, this sentence in corpus should appear a lot of times it? But if you enter this sentence into Google's search box and click Search, you will find that there is no exact match on the returned results. Therefore, we need to propose a more effective "counting" approach.

In order to make things clear, it is necessary to introduce some simple mathematical symbols.

1. Word sequence: W1, W2, W3, ..., WN

2. Chain rule: P (W1, W2, W3, ..., WN) =p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1)

Well, we want to calculate the probability of "I have a gun", that is, to calculate P (i,have,a,gun), according to the chain rules, there are:

P (I,have,a,gun) =p (I) p (have| I) P (a| I,have) P (gun| I,HAVE,A)

But things have not been simplified, for example, to calculate P (gun| I,have,a), unfold according to the conditional probability formula:

P (gun| i,have,a) = P (i,have,a,gun)/P (i,have,a)

What did you find out? In order to calculate P (gun| I,HAVE,A), we need to calculate P (I,have,a,gun) and P (i,have,a) first. Hey? P (I,have,a,gun) is not the value we want to calculate at the beginning? So we went around and we were back in place?

OK, now let's tidy up the idea.

For a sentence, it can be represented as a word sequence: W1, W2, W3, ..., WN. We now want to calculate the probability of the occurrence of a sentence, that is, p (W1, W2, W3, ..., WN). This probability can be solved directly with a number of methods, but the effect is not good, so we use the chain rules, the calculation p (W1, W2, W3, ..., WN) into the calculation of a series of products: P (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1). But after the transformation, the problem did not become simple. What to do?

N-gram is in handy now.

For 1-gram, it is assumed that P (wn|w1w2...wn-1) ≈p (wn|wn-1)

For 2-gram, it is assumed that P (wn|w1w2...wn-1) ≈p (wn|wn-1,wn-2)

For 3-gram, it is assumed that P (wn|w1w2...wn-1) ≈p (wn|wn-1,wn-2,wn-3)

In turn.

So:

Under the 1-gram model:

P (W1, W2, W3, ..., WN) =p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1)

≈p (W1) p (W2|W1) p (W3|W2) p (w4|w3) ... P (wn|wn-1)

Under the 2-gram model:

P (W1, W2, W3, ..., WN) =p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1)

≈p (W1) p (W2|W1) p (W3|W1W2) p (w4|w2w3) ... P (wn|wn-2wn-1)

Under the 3-gram model:

P (W1, W2, W3, ..., WN) =p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|w1w2...wn-1)

≈p (W1) p (W2|W1) p (W3|W1W2) p (w4|w1w2w3) ... P (wn|wn-3wn-2wn-1)

Assuming that we are using the 1-gram model, then:

P (I,have,a,gun) =p (I) p (have| I) P (a|have) p (gun|a).

Then we use the "number" method to find P (I) and the other three conditional probabilities:

P (i) = number of occurrences of I in corpus/total words in corpus

P (have| i) = number of occurrences of I in corpus and the number of occurrences of the corpus (i).

Summing up, this article is only a very simple introduction to N-gram, the purpose is simple and easy to understand, but not rigorous. Interested students can further access to the relevant information. N-gram's content can be found in any book on natural language processing.

Popular Understanding N-gram language model. Go

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Popular Understanding N-gram language model.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Popular Understanding N-gram language model.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support