A description of SOLR similarity algorithm
SOLR 4 and previous versions use the VSM (vector space model) to calculate the similarity (or score) by default. Later versions, the Okapi BM25 (an extension of a binary independent model) belongs to the probabilistic model.
The retrieval model is usually divided into:
- Binary model
- Vector space Model (VSM)
- Tfidf
- Keyword-based search
- Probabilistic models
- Machine learning Model
Similarity label
<similarity>用于声明相似度计算模型,可以由用户定制。 示例如下: <similarity class="solr.DFRSimilarityFactory"> <str name="basicModel">P</str> <str name="afterEffect">L</str> <str name="normalization">H2</str> <float name="c">7</float> </similarity>
The label can support the similarity calculation for a specific field type.
Vsm
The score formula for VSM is as follows:
Okapi BM25
Https://events.static.linuxfound.org/sites/events/files/slides/bm25.pdf
Score (q, D) =∑idf (t) · (TF (T in D) • (k + 1)) /(TF (T in D) + K (1–b + B. |d| /AVGDL) T in Q Where:t = term; d = document; q = query; i = Index TF (t in D) = Numtermoccurrencesindocument? IDF (t) = 1 + log (Numdocs/(Docfreq + 1)) |d| =∑1 T in D avgdl = (∑|d| )/(∑1)) D In I d in i k = F REE parameter. usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.
# # Learning to Rank (LTR)
SOLR also supports LTR.
This piece requires the foundation of machine learning. If not, just read the documentation and check it out. Like me, I can only skip the (-_-) first.
You can read the document in detail:
Https://lucene.apache.org/solr/guide/6_6/learning-to-rank.html
https://www.microsoft.com/en-us/research/project/mslr/
Https://events.static.linuxfound.org/sites/events/files/slides/bm25.pdf
http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/
Https://lucene.apache.org/solr/guide/6_6/relevance.html
SOLR Similarity algorithm