Address: https://en.wikipedia.org/wiki/Okapi_BM25In information retrieval, okapi BM25 (BM stands for best Matching) is a ranking function used by search engines T o Rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s Bystephen E. Robertson, Karen Spärck Jones, and others.The name of the actual ranking function is
The optimization of vertical search results includes the control of search results and the optimization of sorting, among which the ranking is the most serious. In this paper, we will thoroughly explore the evolutionary process of the vertical search ranking model, and finally deduce the ordering of the BM25 model. Then we'll show you how to modify Lucene's sort source code, and the next one will delve into the current hot machine learning sort in ver
before (of course, sometimes it is also related to the document creation time ).
There are many ways to calculate the correlation between words, but we should start with the simplest and statistical-based method. This method does not need to understand the language itself, but determines the "correlation score" by calculating the use, matching, and weights based on the popularity of specific words in the document ".
This algorithm does not care about whether words are nouns or verbs or the mea
not need to understand the language itself, but to determine "related scores" by using statistical words, matching and the weight of the popularity of the specific words in the document.
This algorithm does not care whether words are nouns or verbs, nor do they care about the meaning of words. The only thing it cares about is the common words, those are the rare words. If a search statement includes both common words and rare words, you'd better score higher on documents that contain rare word
language itself, but is determined by the use of statistical terms, matching and the weight of the prevalence of specific words in the document, and other conditions to determine the "relevant score."The algorithm does not care whether the word is a noun or a verb, nor does it care about the meaning of words. The only thing it cares about is the common words, those are rare words. If you have a search statement that includes common words and rare words, you might want to get a higher score for
Algorithm and principle of English word segmentationCalculating formulas based on document dependencies
Tf-idf:http://lutaf.com/210.htm
Bm25:http://lutaf.com/211.htm
Word segmentation quality is extremely important for correlation calculation based on frequency of wordsEnglish (Western language) the basic unit of language is the word, so the word is particularly easy to do, only 3 steps:
Get word groups based on space/symbol
The full name of the BM25 algorithm is Okapi BM25, which is an extension of the binary independent model and can be used to sort the relevance of the search.The default correlation algorithm for Sphinx is the BM25. You can also choose to use the BM25 algorithm after Lucene4.0 (the default is TF-IDF). If you are using S
an exact match for a query phrase (that is, the document directly contains the phrase), the phrase score of the document gets the maximum possible value, that is, the number of words in the query.
The statistical score is based on the classic bm25 function, which only considers word frequency. If a word is rare in the entire database (that is, the low frequency word in the document set) or is frequently mentioned in a specific document (that is, th
easily translate to Semantic Relevance. For example, adding more semantic features, such as the bm25 feature of plsa and the similarity feature of word2vec (or the extended correlation validation, such as extending the word to the abstract extension of the baidu search result) improve the contribution of semantic features.
Relevance is also the cornerstone of all search problems, but it is used in different systems in different ways. In general searc
Xapian Study Notes 3 sorting of related fields
In xapina, hit documents are sorted in descending order of relevance of documents. When the two documents have the same relevance, they are sorted in ascending order of document IDs. You can also set enquire. set_docid_order (enquire. descending) to turn it into a descending order, or set it to an enquire that does not care about the Document ID. set_docid_order (enquire. dont_care); of course, this sorting can also be done by other rules, or by co
;
Similarity: For specifying a document scoring model, there are 2 configurations:
The default TF/IDF algorithm used by Default:elasticsearch and Lucene;
Bm25:okapi BM25 algorithm;
Basically commonly used is these, there is no introduction to everyone can refer to the official documents;
Iv. data types for fields
on the previous article introduced some simple data types in the official known as the c
A description of SOLR similarity algorithmSOLR 4 and previous versions use the VSM (vector space model) to calculate the similarity (or score) by default. Later versions, the Okapi BM25 (an extension of a binary independent model) belongs to the probabilistic model.The retrieval model is usually divided into:
Binary model
Vector space Model (VSM)
Tfidf
Keyword-based search
Probabilistic models
Okapi
function called BM25, which values values between 0 and 1 based on the frequency in the keyword document (high-frequency results in higher weights) and the frequency in the entire index (low-frequency results in high weights).
However, there may be times when you might need to change the weighting method--or you might not calculate weights at all to improve performance, and the result set is sorted by other means. This goal can be achieved by setting
) importance.Correlation refers to whether the return result and input query are related, which is one of the basic problems of search engine, the current algorithms have BM25 and space vector model. This two algorithm elasticsearch all support, the general commercial search engine all uses the BM25 algorithm. The BM25 algorithm calculates the correlation of each
unindexed keyword
ICU word breaker is removed. Do not know whether the future will support ...
Compress=, uncompress=, and languageid= options are removed and there are no alternative features available
SELECT statement
The query syntax on the right side of the match operator is more explicit, eliminating ambiguity
DocId alias support is canceled and can now be used with ROWID
The left side of the match operator must be a table name and no longer support column nam
time has been too long, the local search query will be stopped. Note that if a search queries multiple local indexes, that restriction is used independently of these indexes.function Setmatchmode ($mode)To set the matching pattern for full-text queries, see the description in section 4.1, "matching patterns." The parameter must be a constant corresponding to a known pattern.Warning: (PHP only) query pattern constants cannot be enclosed in quotation marks, which gives a string instead of a const
Address: http://terrier.org/docs/v3.5/dfr_description.htmlThe divergence from randomness (DFR) paradigm is a generalisation of one of the very first models of information retrieval , Harter ' s 2-poisson indexing-model [1]. The 2-poisson model is based on the hypothesis which the level of treatment of the informative words are witnessed by an Elite set of documents, in which these words occur to a relatively greater extent than in the rest of the documents.On the other hand, there is words, whic
to be predicted.
LTR generally has three types of methods: the Single Document Method (Pointwise), the document offset method (pairwise), and the Document list method (Listwise).
1 pointwise
Pointwise the processing object is a single document, after converting the document into Eigenvector, it is mainly to turn the sorting problem into a general classification or regression problem in machine learning. We are now using a multi-class classification as an example: Table 2-1 is a manual a
, "threads, threads_mintue ".
Note: Multiple indexes are connected with the English symbol "," and must be filled in according to the index name in the sphsf-configuration file.
4. set the full-text index name
Enter the full-text primary index name and full-text incremental index name in the Sphinx configuration, for example, "posts, posts_mintue ".
5. set the maximum search time
Enter the maximum search time, in milliseconds. The parameter must be a non-negative integer. The default value is 0,
and BM25 score, and combine the two. * SPH_RANK_BM25, Statistical correlation calculation mode, using only BM25 score calculations (same as most full-text search engines). This pattern is faster, but it may reduce the quality of the results of queries that contain multiple words. * Sph_rank_none, disable the scoring mode, which is the fastest mode. In fact, this pattern is the same as a Boolean search. All
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.