Several symbolic meanings:
R: Related Document Set
NR: Unrelated Document Set
Q: User Query
DJ: Document J
1/0 Risk situation
PRP (probability ranking principle): A probabilistic sequencing principle that uses probabilistic models to estimate each document and demand-related probabilities, and then sorts the results.
Bayesian optimal decision making, based on minimum loss risk, returns documents that are more likely to be relevant than the unrelated possibilities:
the principle of probability sequencing based on retrieval cost:
CRRP (r| D) + CRNP (nr| D) < CNRP (r| D) + CNNP (nr| D)
How to calculate probabilities
Document D can be represented as a vector (d1,d2,..., dn)
Pi = P (di=1| R) 1-pi = P (di=0| R
Qi = P (di=1| NR) 1-qi = P (di=0| NR)
to take the logarithm of this equation:
How to get the initial R and NR
pi=c, C usually takes 0.5
Qi=ni/n NI represents the number of documents that have di appearing, and N indicates the total number of document sets.
Improve it:
For a query Q, according to the initial R and NR, you can get the first K return results. Then add the K results to the R concentration. At this point, the probability calculation method is:
pi = P (di | R) = si/t
Qi = P (di | NR) = (ni-si)/(N-T)
Si represents the number of di contained in a T document
Smooth
Pi = (si+0.5)/(t+1)
Qi = ((ni-si+0.5)/(n-t+1))
Weighted
Change the di to Wi.di to indicate that the word di appears to be 1, or 0 if it does not appear.
BM25 Weighted method
[IR Course note] Probabilistic retrieval model