Natural Language Processing 3-n-gram Model

Source: Internet
Author: User

Let wi be any word in the text, if it is known in the text of the first word wi-1, you can use conditional probability P (WI | wi-1) to predict the probability of Wi appearance. This is the concept of the statistical language model. In general, if the variable W is used to represent any word sequence in the text, it is composed of N words in the ordered order, that is, W = w1w2... the statistical language model is the probability P (w) of the word Sequence W in the text ). Using the product formula of probability, P (w) can be expanded:

P (w) = P (W1) P (W2 | W1) P (W3 | w1w2)... P (WN | W1 W2. .. wn-1)

To predict the probability of occurrence of a word wn, you must know the probability of occurrence of all the words before it. From the computing point of view, this method is too complicated. If the probability of occurrence of any word WI is only related to the first word (Markov hypothesis), the problem can be greatly simplified. The language model is called Bi-gram ):

P (w) ≈ P (W1) P (W2 | W1) P (W3 | W2 )... P (WN | wn-1)

If the probability of occurrence of any word WI is only related to the first two words, the language model is called the tri-gram ):

P (w) ≈ P (W1) P (W2 | W1) P (W3 | w2w1) P (W4 | w3w2 )... P (WN | wn-1wn-2)

In general, the N-element model assumes that the probability of occurrence of the current word is only related to the N-1 of the word before it. It is important that these probability parameters can be calculated through large-scale corpus. For example, the Trielement probability is

P (WI | wi-2wi-1) ≈ count (wi-2wi-1wi)/count (wi-2wi-1)

Count (...) indicates the cumulative number of times a specific word sequence appears in the entire corpus.

 

Here is an example of using bigram. Assume that the total number of words in the corpus is 15000

The following table lists the number of times the word "He is a computer doctor" appears:

He

2500

Yes

3000

Computer

100

Doctor

85

Graduate Student

196

The word sequence frequency is shown in the following table.

 

He

Yes

Computer

Doctor

Graduate Student

He

6

1900

20

15

10

Yes

150

8

80

65

80

Computer

0

300

1

50

100

Doctor

5

50

5

2

110

Graduate Student

3

30

6

3

8

 

P (he is a computer doctor)

= P (HE) P (Yes | he) P (Computer | yes) P (Doctor | computer) P (graduate student | doctor)

= (2500/15000) * (1900/2500) * (80/3000) * (50/100) * (110/196)

 

Here we need to explain the data sparse problem. Suppose there are 10000 words in the word table. If bigram is used, there may be 100002 n-grams. If trigram is used, the possible n-gram contains 100003! The combination of many word pairs does not appear in the corpus, and the probability obtained based on the maximum likelihood estimation will be 0, which will cause great trouble, when calculating the probability of a sentence, once one of the items is 0, the probability of the entire sentence will be 0. The final result is that our model can only calculate a few pitiful sentences, the probability of most sentences is 0. therefore, we need to perform data smoothing (Data
Smoothing), there are two purposes of data smoothing: one is to make the sum of all n-gram probabilities 1, so that all n-gram probabilities are not 0.

The purpose of data smoothing is clear. How can we achieve this goal? This is what Goode-Turing estimated to solve. The principle of good-Turing's estimation is: for an event that has not been seen, we cannot think that the probability of its occurrence is zero, therefore, assign a very small proportion from the total probability to these unseen events. In this way, the total probability of the events you see is smaller than 1. Therefore, you need to reduce the probability of the events you want to see. The smaller the discount, the more untrusted the discount is.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.