Lucene scoring Mechanism

Last Update:2018-12-05 Source: Internet

Author: User

Tags square root idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

You can use the Searcher. explain (Query query, int doc) method to view the specific composition of a document's score.

In Lucene, the score is calculated by tf * idf * boost * lengthNorm.

Tf: the square root of the number of times the query word appears in the document
Idf: indicates the document frequency to be reversed. After observing that all documents are the same, it is useless and does not take any decision.
Boost: the incentive factor can be set through the setBoost method. You can set it through field and doc, And the set value will take effect at the same time.
LengthNorm: it is determined by the length of the field to be searched. The longer the document, the lower the score.

So what we can program to control the score is to set the boost value.

Another question is, why is the maximum score always 1.0 after a query?
Because Lucene takes over 1.0 of the calculated maximum score as the denominator, and the scores of other documents are divided by the maximum value to calculate the final score.

The code and running result are described as follows:

Java code

Public class ScoreSortTest {
Public final static String INDEX_STORE_PATH = "index ";
Public static void main (String [] args) throws Exception {
IndexWriter writer = new IndexWriter (INDEX_STORE_PATH,
New StandardAnalyzer (), true );
Writer. setUseCompoundFile (false );
Document doc1 = new Document ();
Document doc2 = new Document ();
Document doc3 = new Document ();
Field f1 = new Field ("bookname", "bc", Field. Store. YES, Field. Index. TOKENIZED );
Field f2 = new Field ("bookname", "AB bc", Field. Store. YES, Field. Index. TOKENIZED );
Field f3 = new Field ("bookname", "AB bc cd", Field. Store. YES,
Field. Index. TOKENIZED );
Doc1.add (f1 );
Doc2.add (f2 );
Doc3.add (f3 );
Writer. addDocument (doc1 );
Writer. addDocument (doc2 );
Writer. adddocument (doc3 );
Writer. Close ();
Indexsearcher searcher = new indexsearcher (index_store_path );
Termquery q = new termquery (new term ("bookname", "BC "));
Q. setboost (2f );
Hits hits = searcher. Search (Q );
For (int I = 0; I
Document doc = hits.doc (I );
System. out. print (doc. get ("bookname") + "" t "t ");
System. out. println (hits. score (I ));
System. out. println (searcher. explain (q, hits. id (I )));//
}
}
}

Running result:

Reference bc 0.629606
0.629606 = (MATCH) fieldWeight (bookname: bc in 0), product:
1.4142135 = tf (termFreq (bookname: bc) = 2)
0.71231794 = idf (docFreq = 3, numDocs = 3)
0.625 = fieldNorm (field = bookname, doc = 0)

AB bc 0.4451987
0.4451987 = (MATCH) fieldWeight (bookname: bc in 1), product:
1.0 = tf (termFreq (bookname: bc) = 1)
0.71231794 = idf (docFreq = 3, numDocs = 3)
0.625 = fieldNorm (field = bookname, doc = 1)

AB bc cd 0.35615897
0.35615897 = (MATCH) fieldWeight (bookname: bc in 2), product:
1.0 = tf (termFreq (bookname: bc) = 1)
0.71231794 = idf (docFreq = 3, numDocs = 3)
0.5 = fieldNorm (field = bookname, doc = 2)

We can see from the results:
Bc appears twice in the bc document. tf is the square root of 2, so it is 1.4142135. The other two documents appear once, so they are 1.0
All three documents have the same idf value, which is 0.71231794.
By default, the boost value is 1.0, so lengthNorm is the current fieldNorm value. The first two documents have the same length, which is 0.625, while the last one is 0.5 because the length is longer.

Now we have added an incentive factor f2.setBoost (2.0f) to the f2 field );
The running result is:

Reference AB bc 0.8903974
0.8903974 = (MATCH) fieldWeight (bookname: bc in 1), product:
1.0 = tf (termFreq (bookname: bc) = 1)
0.71231794 = idf (docFreq = 3, numDocs = 3)
1.25 = fieldNorm (field = bookname, doc = 1)

It is found that the fieldNorm value is 0.625 to 1.25, so it is multiplied by 2.0.

Next, add the incentive factor doc2.setBoost (2.0f) to the second document );
The running result is:

Reference AB bc 1.0
1.7807949 = (MATCH) fieldWeight (bookname: bc in 1), product:
1.0 = tf (termFreq (bookname: bc) = 1)
0.71231794 = idf (docFreq = 3, numDocs = 3)
2.5 = fieldNorm (field = bookname, doc = 1)

FieldNorm is multiplied by 2, so the Document and Field setBoost will be multiplied together.

Because the final score of this document exceeds 1.0 to 1.7807949, the final score of the other two documents must be divided by this value,
Change:

Reference bc 0.35355335
AB bc cd 0.19999999

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More