Lucene scoring Mechanism

Source: Internet
Author: User
Tags square root idf

You can use the Searcher. explain (Query query, int doc) method to view the specific composition of a document's score.

In Lucene, the score is calculated by tf * idf * boost * lengthNorm.

Tf: the square root of the number of times the query word appears in the document
Idf: indicates the document frequency to be reversed. After observing that all documents are the same, it is useless and does not take any decision.
Boost: the incentive factor can be set through the setBoost method. You can set it through field and doc, And the set value will take effect at the same time.
LengthNorm: it is determined by the length of the field to be searched. The longer the document, the lower the score.

So what we can program to control the score is to set the boost value.

Another question is, why is the maximum score always 1.0 after a query?
Because Lucene takes over 1.0 of the calculated maximum score as the denominator, and the scores of other documents are divided by the maximum value to calculate the final score.

The code and running result are described as follows:

 

Java code
  1. Public class ScoreSortTest {
  2. Public final static String INDEX_STORE_PATH = "index ";
  3. Public static void main (String [] args) throws Exception {
  4. IndexWriter writer = new IndexWriter (INDEX_STORE_PATH,
  5. New StandardAnalyzer (), true );
  6. Writer. setUseCompoundFile (false );
  7. Document doc1 = new Document ();
  8. Document doc2 = new Document ();
  9. Document doc3 = new Document ();
  10. Field f1 = new Field ("bookname", "bc", Field. Store. YES, Field. Index. TOKENIZED );
  11. Field f2 = new Field ("bookname", "AB bc", Field. Store. YES, Field. Index. TOKENIZED );
  12. Field f3 = new Field ("bookname", "AB bc cd", Field. Store. YES,
  13. Field. Index. TOKENIZED );
  14. Doc1.add (f1 );
  15. Doc2.add (f2 );
  16. Doc3.add (f3 );
  17. Writer. addDocument (doc1 );
  18. Writer. addDocument (doc2 );
  19. Writer. adddocument (doc3 );
  20. Writer. Close ();
  21. Indexsearcher searcher = new indexsearcher (index_store_path );
  22. Termquery q = new termquery (new term ("bookname", "BC "));
  23. Q. setboost (2f );
  24. Hits hits = searcher. Search (Q );
  25. For (int I = 0; I
  26. Document doc = hits.doc (I );
  27. System. out. print (doc. get ("bookname") + "" t "t ");
  28. System. out. println (hits. score (I ));
  29. System. out. println (searcher. explain (q, hits. id (I )));//
  30. }
  31. }
  32. }


Running result:

Reference bc 0.629606
0.629606 = (MATCH) fieldWeight (bookname: bc in 0), product:
1.4142135 = tf (termFreq (bookname: bc) = 2)
0.71231794 = idf (docFreq = 3, numDocs = 3)
0.625 = fieldNorm (field = bookname, doc = 0)

AB bc 0.4451987
0.4451987 = (MATCH) fieldWeight (bookname: bc in 1), product:
1.0 = tf (termFreq (bookname: bc) = 1)
0.71231794 = idf (docFreq = 3, numDocs = 3)
0.625 = fieldNorm (field = bookname, doc = 1)

AB bc cd 0.35615897
0.35615897 = (MATCH) fieldWeight (bookname: bc in 2), product:
1.0 = tf (termFreq (bookname: bc) = 1)
0.71231794 = idf (docFreq = 3, numDocs = 3)
0.5 = fieldNorm (field = bookname, doc = 2)


We can see from the results:
Bc appears twice in the bc document. tf is the square root of 2, so it is 1.4142135. The other two documents appear once, so they are 1.0
All three documents have the same idf value, which is 0.71231794.
By default, the boost value is 1.0, so lengthNorm is the current fieldNorm value. The first two documents have the same length, which is 0.625, while the last one is 0.5 because the length is longer.

Now we have added an incentive factor f2.setBoost (2.0f) to the f2 field );
The running result is:

Reference AB bc 0.8903974
0.8903974 = (MATCH) fieldWeight (bookname: bc in 1), product:
1.0 = tf (termFreq (bookname: bc) = 1)
0.71231794 = idf (docFreq = 3, numDocs = 3)
1.25 = fieldNorm (field = bookname, doc = 1)


It is found that the fieldNorm value is 0.625 to 1.25, so it is multiplied by 2.0.

Next, add the incentive factor doc2.setBoost (2.0f) to the second document );
The running result is:

Reference AB bc 1.0
1.7807949 = (MATCH) fieldWeight (bookname: bc in 1), product:
1.0 = tf (termFreq (bookname: bc) = 1)
0.71231794 = idf (docFreq = 3, numDocs = 3)
2.5 = fieldNorm (field = bookname, doc = 1)


FieldNorm is multiplied by 2, so the Document and Field setBoost will be multiplied together.

Because the final score of this document exceeds 1.0 to 1.7807949, the final score of the other two documents must be divided by this value,
Change:

Reference bc 0.35355335
AB bc cd 0.19999999

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.