Address: https://en.wikipedia.org/wiki/Okapi_BM25
In information retrieval, okapi BM25 (BM stands for best Matching) is a ranking function used by search engines T o Rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s Bystephen E. Robertson, Karen Spärck Jones, and others.
The name of the actual ranking function is BM25. To set the right context, however, it usually referred to as "Okapi BM25", since the Okapi information retrieval system, I Mplemented at London's City University in the 1980s and 1990s, is the first system to implement this function.
BM25, and its newer variants, e.g. bm25f (a version of BM25 so can take document structure and anchor text to account) , represent State-of-the-art tf-idf-like retrieval functions used in document retrieval, such as Web search.
The ranking function[edit]
BM25 is a bag-of-words retrieval function This ranks a set of documents based on the query terms appearing in each documen T, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It isn't a single function, but actually a whole family of scoring functions, with slightly different components and para Meters. One of the most prominent instantiations of the function is as follows.
Given a query, containing keywords, the BM25 score of a document is:
Where IS's term frequency in the document, is the length of the document in words, and is the average document Lengt h in the text collection from which documents is drawn. and is free parameters, usually chosen, in absence of a advanced optimization, as and. [1] is the IDF (inverse document frequency) weight of the. It is usually computed as:
Where is the total number of documents in the collection, and is the number of documents containing.
There is several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component are derived from the Binary independence Model.
Please note this above formula for IDF shows potentially major drawbacks when using it for terms appearing in more tha n Half of the corpus documents. These terms ' IDF is negative, so for any of the almost-identical documents, one which contains the term and one which does no T contain it, the latter would possibly get a larger score. This means, terms appearing in more than half of the corpus would provide negative contributions to the final Document Score. This was often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different :
- Each summand can is given a floor of 0, to trim out common terms;
- The IDF function can be given a floor of a constant, to avoid common terms being ignored at all;
- The IDF function can be replaced with a similarly shaped one which are non-negative, or strictly positive to avoid terms be ing ignored at all.
IDF Information theoretic interpretation[edit]
This is a interpretation from information theory. Suppose a query term appears in documents. Then a randomly picked document would contain the term with probability (where was again the cardinality of the set of Do Cuments in the collection). Therefore, the informationcontent of the message "contains" is:
Now suppose we have both query terms and. If the terms occur in documents entirely independently of all other, then the probability of seeing both and in a R Andomly picked document is:
and the information content of such an event is:
With a small variation, this is exactly what's expressed by the IDF component of BM25.
modifications[edit]
- At the extreme values of the coefficient BM25 turns to ranking functions known as BM11 (for) and BM15 (for). [2]
- bm25f [3] is a modification of BM25 in which the document was considered to being composed from several fields (such as headline s, main text, anchor text) with possibly different degrees of importance.
- bm25+ [4] is an extension of BM25. Bm25+ is developed to address one deficiency of the standard BM25 in which the component of term frequency normalization By document length was not properly lower-bounded; As a result of this deficiency, long documents which does match the query term can often is scored unfairly by BM25 as Havin G A similar relevancy to shorter documents this does not contain the query for the term of all. The scoring formula of bm25+ only have one additional free parameter (a default value are in absence of a training data) a s compared with BM25:
SOLR Similarity Algorithm II: Okapi BM25