Natural Language Processing 3-n-gram Model

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let wi be any word in the text, if it is known in the text of the first word wi-1, you can use conditional probability P (WI | wi-1) to predict the probability of Wi appearance. This is the concept of the statistical language model. In general, if the variable W is used to represent any word sequence in the text, it is composed of N words in the ordered order, that is, W = w1w2... the statistical language model is the probability P (w) of the word Sequence W in the text ). Using the product formula of probability, P (w) can be expanded:

P (w) = P (W1) P (W2 | W1) P (W3 | w1w2)... P (WN | W1 W2. .. wn-1)

To predict the probability of occurrence of a word wn, you must know the probability of occurrence of all the words before it. From the computing point of view, this method is too complicated. If the probability of occurrence of any word WI is only related to the first word (Markov hypothesis), the problem can be greatly simplified. The language model is called Bi-gram ):

P (w) ≈ P (W1) P (W2 | W1) P (W3 | W2 )... P (WN | wn-1)

If the probability of occurrence of any word WI is only related to the first two words, the language model is called the tri-gram ):

P (w) ≈ P (W1) P (W2 | W1) P (W3 | w2w1) P (W4 | w3w2 )... P (WN | wn-1wn-2)

In general, the N-element model assumes that the probability of occurrence of the current word is only related to the N-1 of the word before it. It is important that these probability parameters can be calculated through large-scale corpus. For example, the Trielement probability is

P (WI | wi-2wi-1) ≈ count (wi-2wi-1wi)/count (wi-2wi-1)

Count (...) indicates the cumulative number of times a specific word sequence appears in the entire corpus.

Here is an example of using bigram. Assume that the total number of words in the corpus is 15000

The following table lists the number of times the word "He is a computer doctor" appears:

He	2500
Yes	3000
Computer	100
Doctor	85
Graduate Student	196

The word sequence frequency is shown in the following table.

	He	Yes	Computer	Doctor	Graduate Student
He	6	1900	20	15	10
Yes	150	8	80	65	80
Computer	0	300	1	50	100
Doctor	5	50	5	2	110
Graduate Student	3	30	6	3	8

P (he is a computer doctor)

= P (HE) P (Yes | he) P (Computer | yes) P (Doctor | computer) P (graduate student | doctor)

= (2500/15000) * (1900/2500) * (80/3000) * (50/100) * (110/196)

Here we need to explain the data sparse problem. Suppose there are 10000 words in the word table. If bigram is used, there may be 100002 n-grams. If trigram is used, the possible n-gram contains 100003! The combination of many word pairs does not appear in the corpus, and the probability obtained based on the maximum likelihood estimation will be 0, which will cause great trouble, when calculating the probability of a sentence, once one of the items is 0, the probability of the entire sentence will be 0. The final result is that our model can only calculate a few pitiful sentences, the probability of most sentences is 0. therefore, we need to perform data smoothing (Data
Smoothing), there are two purposes of data smoothing: one is to make the sum of all n-gram probabilities 1, so that all n-gram probabilities are not 0.

The purpose of data smoothing is clear. How can we achieve this goal? This is what Goode-Turing estimated to solve. The principle of good-Turing's estimation is: for an event that has not been seen, we cannot think that the probability of its occurrence is zero, therefore, assign a very small proportion from the total probability to these unseen events. In this way, the total probability of the events you see is smaller than 1. Therefore, you need to reduce the probability of the events you want to see. The smaller the discount, the more untrusted the discount is.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Natural Language Processing 3-n-gram Model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Natural Language Processing 3-n-gram Model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support