[IR Course notes] Statistical language model

Source: Internet
Author: User
Tags diff

Basic idea

1. One document with only one subject (topic)

2. The topic refers to how the words appear in the document under this topic

3. The term often appearing in a document under a topic is also frequently seen in this topic.

4. Words that do not appear frequently in a document under a topic are also infrequently seen in this topic.

5. Thus, the probability calculation method can be approximated as:

Ranking

when a query q is given, how do you sort it according to the statistical language model? There are three kinds of sorting methods, namely: 1. Query-likelihood 2.document-likelihood

3.Divergence (diff) of query and document models

Query q = (q1,q2,..., qk), MD represents the document under the statistical language model.

1.query-likelihood

Example:

Q = "people create" D1 = "in the long history of the Chinese people's hard work to explore the courage to create the Chinese people love Peace"

P ("People" | MD1) =2/18, P ("Creation" | MD1) =1/18

P (q| MD1) = P ("People" | MD1) *p ("Creation" | MD1) = 2/18 * 1/18

2.document-likelihood

Problem: A. The length of the document varies greatly, and it is difficult to compare B. Because many of the words appearing in the document do not appear in the query, there will be a 0-frequency problem c. Meaningless cheat pages will appear

Ways to solve these problems:

3.Divergence (diff) of query and document models

The upper-middle W refers to the words that appear at the same time in Q and D, whose meaning is to encode D with Q, the number of digits required

0 frequency problem

Workaround: 1. Laplace smoothing: Add 1 to the word frequency of each term.

2.Lindstone correction: Add each word to a very small value, ε.

3.Absolute Discounting: Subtract a small value ε from a word that is not equal to 0, and then distribute these values evenly to words with a word frequency of 1.

[IR Course notes] Statistical language model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.