Lucene scoring scoring mechanism

Source: Internet
Author: User
Tags idf

Original source: http://blog.chenlb.com/2009/08/lucene-scoring-architecture.html

The Lucene scoring system/mechanism (Lucene scoring) is a core part of Lucene's reputation. It hides a lot of complicated details for the user, which makes it easy for users to use Lucene. But personally think: if you want to adjust the score (or structure sort) according to your own application, it is very important to have a thorough understanding of lucene scoring mechanism.

The Lucene scoring combination uses the vector space model and the Boolean model of information retrieval.

First look at Lucene's scoring formula (description in the similarity class)

Score (Q,D) = Coord (q,d) · Querynorm (q) · ( TF (T in D) •  IDF (t) 2 ·  T.getboost () · Norm (t,d) )
T in Q

which

  1. TF (t in D) associated to the item frequency, which refers to item T in The number of occurrences in document D frequency. The default implementation is:
    tf (t in D) = frequency ½
  2. IDF (t) is associated to the reverse document frequency, and the document frequency refers to the number of documents that appear in item T docfreq. The less docfreq the more IDF the higher (the matter is more expensive), but under the same query the values are the same. Default implementation:
    IDF (t) = 1 + log (
    Numdocs
    –––––––––
    Docfreq+1
    )
  3. The coord (q,d) score factor is based on the number of query items that appear in the document. The more query items in a document, the higher the number of matching programs that describe the documents. The default is the percentage of query items that appear.
  4. querynorm (q)Query the standard query so that different queries can be compared. This factor does not affect the sorting of documents, because this factor is used by all documents. Default value:
    Querynorm (q) = Querynorm (sumofsquaredweights) =
    1
    ––––––––––––––
    Sumofsquaredweights½

    The split Felix (Sumofsquaredweights) for each query item weight is done by the Weight class. For example, booleanquery calculation:

    Sumofsquaredweights = Q.getboost () 2 · ( IDF (t) · T.getboost () ) 2
    T in Q
  5. t.getboost () the item T-weighted (for example: java^1.2) of the query period, or Setboost () by the program.
  6. Norm (t,d)Compress the weighting and length factors during several indexes:
      • Document Boost -file weighting, using Doc.setboost () before indexing
      • Field boost -weighted fields, also called Field.setboost () before the index
      • lengthnorm (field) -calculates this value by the number of tokens in the field, the shorter the field, the higher the score, and the Similarity.lengthnorm when the index is made.
    All of these factors multiply to derive the norm value, and if the document has the same field, their weighting is multiplied:

    Norm (t,d) = Doc.getboost () · Lengthnorm (field) · F.getboost ()
    Field F in D named as T

    Index, the norm value is compressed (encode) into a byte stored in the index. When searching, the norm value in the index is decompressed (decode) into a float value, which is encode/decode provided by similarity. "The process is not reversible due to accuracy problems, such as: Decode (Encode (0.89)) = 0.75," the official said.

The calculation of this score involves several core classes/interfaces: Similarity, Query, Weight, scorer, Searcher, which are computed by their or their subclasses to perform the scoring. First look at their class diagram:

Lucene Search Score UML, click to enlarge

In the search, the process of scoring:

    1. Create a query object, and pass it to Searcher, which may be indexsearcher.
    2. Searcher creates a corresponding Weight (which is the internal feature representation of query) based on query, and Weight creates the corresponding scorer.
    3. Searcher creates hitcollector and uploads scorer,scorer to find a matching document and calculates the score, which is then written to Hitcollector.

Query, Weight, scorer San du are very closely related, especially query and Weight. Weight is the calculation of query weights and creation of scorer. Query in order to reuse the internal features as Weight, subclasses to complete some of the calculation of the relevant scores.

Any Searcher-dependent state is stored in the Weight implementation, not in query, so that query can be reused.

Weight life cycle (used):

    1. Weight is created by the top-level Query. Query.createweight (Searcher), created Weight for Searcher to use.
    2. Weight.sumofsquaredweights () is called when Similarity.querynorm (float) is used to calculate the query normalization factor (normalization).
    3. The query Normalization factor (normalization) is passed to Weight.normalize (float) calculation, at which point the weight (weighting) calculation is completed.
    4. Create a scorer.

Calculation of custom ratings

A similarity can be implemented to replace the default. It is limited to scorer, Weight calculation good factor value of reprocessing. To have more control over scoring, you can achieve a set of Query, Weight, scorer.

    • Query is an abstraction of user information needs
    • Weight is an abstraction of the internal attribute representation of Query
    • Scorer abstract common computational scoring capabilities that provide the ability to calculate scoring and commentary (explanation) scores.

The Query subclass implements the method:

    1. Createweight (Searcher Searcher)-Weight is the internal representation of query, so each query must implement a Weight, this method is to generate a query corresponding to the Weight object.
    2. Rewrite (Indexreader reader)-Rewrite the query as the original query, the original query has: Termquery,booleanquery ...

Weight interface Method:

    1. Weight#getquery ()--Indicates the Query representing Weight.
    2. Weight#getvalue ()--the weight of the Query, for example: Termquery.termweight's value = idf^2 * Boost * querynorm
    3. Weight#sumofsquaredweights ()--the sum of squares of each query item, e.g., Termweight = (IDF * boost) ^2
    4. Weight#normalize (float)--the factor that determines the standardization of queries, the query normalized values can be compared in different query score
    5. Weight#scorer (Indexreader)--Create a scorer for the query corresponding to the rating of the document to which the query matches.
    6. Weight#explain (indexreader, int)--explains in detail how the scoring value is obtained for the specified document.

Scorer subclasses implement the method:

    1. Scorer#next ()--prefetch matches to the next document, which returns TRUE.
    2. Scorer#doc ()--Returns the document ID of the current match, which must be called after next () to be valid.
    3. Scorer#score ()--Returns the rating of the current document, which can be given by the application in any appropriate way, such as Termscorer returns TF * Weight.getvalue () * Fieldnorm
    4. Scorer#skipto (int)--jumps to a matching document that is greater than or equal to int. In many cases, skipto the comparison cycle in the result set is faster and more efficient.
    5. Scorer#explain (int)-gives the details of the scoring.

To achieve a set of Query, Weight, scorer, it is best to look at Termquery, Termweight, Termscorer.

When there are no queries in Lucene (including different scoring details), a custom query can be helpful.

Important Reference information:

    • Http://lucene.apache.org/java/2_4_1/scoring.html
    • Http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/package-summary.html#scoring
    • Http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Weight.html

Lucene scoring scoring mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.