Basic idea
1. One document with only one subject (topic)
2. The topic refers to how the words appear in the document under this topic
3. The term often appearing in a document under a topic is also frequently seen in this topic.
4. Words that do not appear frequently in a document under a topic are also infrequently seen in this topic.
5. Thus, the probability calculation method can be approximated as:
Ranking
when a query q is given, how do you sort it according to the statistical language model? There are three kinds of sorting methods, namely: 1. Query-likelihood 2.document-likelihood
3.Divergence (diff) of query and document models
Query q = (q1,q2,..., qk), MD represents the document under the statistical language model.
1.query-likelihood
Example:
Q = "people create" D1 = "in the long history of the Chinese people's hard work to explore the courage to create the Chinese people love Peace"
P ("People" | MD1) =2/18, P ("Creation" | MD1) =1/18
P (q| MD1) = P ("People" | MD1) *p ("Creation" | MD1) = 2/18 * 1/18
2.document-likelihood
Problem: A. The length of the document varies greatly, and it is difficult to compare B. Because many of the words appearing in the document do not appear in the query, there will be a 0-frequency problem c. Meaningless cheat pages will appear
Ways to solve these problems:
3.Divergence (diff) of query and document models
The upper-middle W refers to the words that appear at the same time in Q and D, whose meaning is to encode D with Q, the number of digits required
0 frequency problem
Workaround: 1. Laplace smoothing: Add 1 to the word frequency of each term.
2.Lindstone correction: Add each word to a very small value, ε.
3.Absolute Discounting: Subtract a small value ε from a word that is not equal to 0, and then distribute these values evenly to words with a word frequency of 1.
[IR Course notes] Statistical language model