The scoring mechanism of Lucene

Source: Internet
Author: User
Tags square root idf

The scoring mechanism of Lucene

Elasticsearch is based on Lucene, so his scoring mechanism is also based on Lucene. The score is the rate at which we searched for the phrase and relevance of each document in the index.
If there is no intervention scoring algorithm, Lucene calculates the relevant score for all documents and search statements based on a scoring algorithm for each query.
Using Lucene's scoring mechanism is basically able to put the search that best meets the needs of the user at the front.
Of course, sometimes we might want to customize the scoring algorithm, which has nothing to do with Lucene's scoring algorithm. Of course, most of us should still adjust the lucene algorithm according to our own needs.

Lucene's scoring formula

Lucene score is called TF/IDF algorithm, the basic meaning is the word frequency algorithm.
According to the word thesaurus, all documents are divided into Word segmentation when indexing is established. When searching, the phrase of the search is also divided into Word segmentation.
TF represents the number of occurrences of a word breaker in a document (term frequency), and IDF represents how many documents the word breaker appears in (inverse document frequency).

Lucene's algorithm is simply to search the phrase to get word segmentation, each word-breaker and the document in each index according to the frequency of TF/IDF of the score calculation.
Then the score of each participle is added, which is the corresponding document score for this search.

This scoring formula is made up of 6 parts

    • Coord (Q,D) score factor, based on the number of query items that appear in the document. The more query items in a document, the higher the document's matching degree.
    • Querynorm (q) query for standard queries
    • TF (T in D) refers to the number of occurrences of item T in document D frequency. The specific value is the opening root of the number of times.
    • IDF (t) reverses the document frequency, appearing the number of documents for item T Docfreq
    • T.getboost query when query item weighting
    • Norm (t,d) length-dependent weighting factors
Coord (q, D)

The calculation formula for this scoring factor is:

public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap;}
    • Overlap: Number of Hit searches in a document
    • Maxoverlap: Number of search criteria

For example, to search for "Chinese book", there is now a document that is "a Chinese book".
So, the overlap for this document is 1 (because it matches the book) and Maxoverlap is 2 (because the search criteria have two book and 中文版).
The final result of this search corresponds to a coord value of 0.5 for this document.

Querynorm (q)

This factor is the same value for all documents, so it does not affect the sorting result. For example, if we want all the documents to have a larger score, then we need to set this value.

public float queryNorm(float sumOfSquaredWeights) { return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));}
TF (T in D)

Number of occurrences of item T in document D

public float tf(float freq) { return (float)Math.sqrt(freq);}

For example, there is a document called "This was book on Chinese book", My Search term "book", then this search item corresponding document FREQ is 2, then the TF value is the square root 2, that is 1.4142135

Idf
public float idf(long docFreq, long numDocs) { return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);}

Here's two values explained below

    • Docfreq refers to the number of documents that the item appears in, that is, how many documents match the search
    • Numdocs refers to how many documents are in an index.

I'm actually looking at this here with ES, there is a problem, the number of numdocs and the actual number of documents inconsistent, and finally figured out, here Numdocs refers to the fragmented document data, not all shards of the document number.
So when using ES to analyze this formula, it's best to set the number of shards to 1.

For example, I now have three documents, respectively:

    • This was about 中文版
    • This are about Chinese
    • This was about Japan

The word I'm searching for is "Chinese", so for the second document, the Docfreq value is 1, because only one document matches the search, and Numdocs is 3. Finally, the value of the IDF is calculated as:

(float) (Math.log (numdocs/(double) (docfreq+1)) + 1.0) = ln (3/(+)) + 1 = ln (1.5) + 1 = 0.40546510810816 + 1 = 1.40546510810816

T.getboost

Query the weight of the item T, this is an impact value, such as I want to match Chinese weight higher, you can set its boost to 2

Norm (T,d)

This item is the weighted factor of length, which is intended to be the same as matching the document, compared to a shorter set of preceding.
For example, two documents:

    • Chinese
    • Chinese book

When I search for Chinese, the first document will be compared to the previous one. Because it is more in line with "exact match."

state) {    final int numTerms;    if (discountOverlaps)        numTerms = state.getLength() - state.getNumOverlap(); else numTerms = state.getLength(); return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));}

The doc.getboost here represents the weight of the document, F.getboost represents the weight of the field, and if both are set to 1, then nor (T,D) is the same value as Lengthnorm.

For example, I now have a document:

    • Chinese book

The search term is Chinese, then the value of numterms for 2,lengthnorm is 1/sqrt (2) = 0.71428571428571.

Unfortunately, if you use explain to check ES, you find that Lengthnorm shows only 0.625.
This official gives the reason is the accuracy problem, norm in the storage will be compressed, when the query decompression, and this decompression is irreversible, that is, decode (encode (0.714)) = 0.625.

Example

Es can be viewed using the _explain interface for scoring interpretation.

For example, now my document is:

    • Chinese book

The search terms are:

{  "query": {    "match": {      "content": "chinese"    }  }}

The results obtained by explain are:

{    "_index ":"Scoretest", "_type ":"Test", "_id ":"2", "Matched ":True, "Explanation ": {"Value ":0.8784157, "Description ":"Weight (Content:chinese in 1) [perfieldsimilarity], result of:", "Details ": [{"Value ":0.8784157, "description": details ": [{" value ": 1," description ": " TF (freq=1.0), with freq of: "," Span class= "hljs-attr" >details ": [{" value ": 1," description":  "termfreq=1.0"}]}, {"value": 1.4054651, "description": value ": 0.625,"  Description ": " Fieldnorm (Doc=1) "}]}"}}       

See this document has to be divided into 0.8784157

    • TF (T in D): 1
    • IDF:LN (3/(+)) + 1 = 1.4054651
    • Norm (T,d): Decode (Encode (1/SQRT (2))) = 0.625
    • Score: 1.4054651 * 0.625 = 0.8784157

The scoring mechanism of Lucene

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.