Lucene scoring scoring mechanism

Last Update:2014-10-16 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original source: http://blog.chenlb.com/2009/08/lucene-scoring-architecture.html

The Lucene scoring system/mechanism (Lucene scoring) is a core part of Lucene's reputation. It hides a lot of complicated details for the user, which makes it easy for users to use Lucene. But personally think: if you want to adjust the score (or structure sort) according to your own application, it is very important to have a thorough understanding of lucene scoring mechanism.

The Lucene scoring combination uses the vector space model and the Boolean model of information retrieval.

First look at Lucene's scoring formula (description in the similarity class)

Score (Q,D) = Coord (q,d) · Querynorm (q) ·	∑	( TF (T in D) • IDF (t) 2 · T.getboost () · Norm (t,d) )
	T in Q

which

TF (t in D) associated to the item frequency, which refers to item T in The number of occurrences in document D frequency. The default implementation is:

tf (t in D) = frequency ½

IDF (t) is associated to the reverse document frequency, and the document frequency refers to the number of documents that appear in item T docfreq. The less docfreq the more IDF the higher (the matter is more expensive), but under the same query the values are the same. Default implementation:

IDF (t) =

1 + log (

Numdocs

–––––––––

Docfreq+1

)

The coord (q,d) score factor is based on the number of query items that appear in the document. The more query items in a document, the higher the number of matching programs that describe the documents. The default is the percentage of query items that appear.

querynorm (q)Query the standard query so that different queries can be compared. This factor does not affect the sorting of documents, because this factor is used by all documents. Default value:

Querynorm (q) = Querynorm (sumofsquaredweights) =

––––––––––––––

Sumofsquaredweights½

The split Felix (Sumofsquaredweights) for each query item weight is done by the Weight class. For example, booleanquery calculation:

Sumofsquaredweights = Q.getboost () 2 ·	∑	( IDF (t) · T.getboost () ) 2
	T in Q

t.getboost () the item T-weighted (for example: java^1.2) of the query period, or Setboost () by the program.
Norm (t,d)Compress the weighting and length factors during several indexes:
All of these factors multiply to derive the norm value, and if the document has the same field, their weighting is multiplied:

Norm (t,d) = Doc.getboost () · Lengthnorm (field) · ∏ F.getboost ()

Field F in D named as T

Index, the norm value is compressed (encode) into a byte stored in the index. When searching, the norm value in the index is decompressed (decode) into a float value, which is encode/decode provided by similarity. "The process is not reversible due to accuracy problems, such as: Decode (Encode (0.89)) = 0.75," the official said.

The calculation of this score involves several core classes/interfaces: Similarity, Query, Weight, scorer, Searcher, which are computed by their or their subclasses to perform the scoring. First look at their class diagram:

Lucene Search Score UML, click to enlarge

In the search, the process of scoring:

Create a query object, and pass it to Searcher, which may be indexsearcher.
Searcher creates a corresponding Weight (which is the internal feature representation of query) based on query, and Weight creates the corresponding scorer.
Searcher creates hitcollector and uploads scorer,scorer to find a matching document and calculates the score, which is then written to Hitcollector.

Query, Weight, scorer San du are very closely related, especially query and Weight. Weight is the calculation of query weights and creation of scorer. Query in order to reuse the internal features as Weight, subclasses to complete some of the calculation of the relevant scores.

Any Searcher-dependent state is stored in the Weight implementation, not in query, so that query can be reused.

Weight life cycle (used):

Weight is created by the top-level Query. Query.createweight (Searcher), created Weight for Searcher to use.
Weight.sumofsquaredweights () is called when Similarity.querynorm (float) is used to calculate the query normalization factor (normalization).
The query Normalization factor (normalization) is passed to Weight.normalize (float) calculation, at which point the weight (weighting) calculation is completed.
Create a scorer.

Calculation of custom ratings

A similarity can be implemented to replace the default. It is limited to scorer, Weight calculation good factor value of reprocessing. To have more control over scoring, you can achieve a set of Query, Weight, scorer.

Query is an abstraction of user information needs
Weight is an abstraction of the internal attribute representation of Query
Scorer abstract common computational scoring capabilities that provide the ability to calculate scoring and commentary (explanation) scores.

The Query subclass implements the method:

Createweight (Searcher Searcher)-Weight is the internal representation of query, so each query must implement a Weight, this method is to generate a query corresponding to the Weight object.
Rewrite (Indexreader reader)-Rewrite the query as the original query, the original query has: Termquery,booleanquery ...

Weight interface Method:

Weight#getquery ()--Indicates the Query representing Weight.
Weight#getvalue ()--the weight of the Query, for example: Termquery.termweight's value = idf^2 * Boost * querynorm
Weight#sumofsquaredweights ()--the sum of squares of each query item, e.g., Termweight = (IDF * boost) ^2
Weight#normalize (float)--the factor that determines the standardization of queries, the query normalized values can be compared in different query score
Weight#scorer (Indexreader)--Create a scorer for the query corresponding to the rating of the document to which the query matches.
Weight#explain (indexreader, int)--explains in detail how the scoring value is obtained for the specified document.

Scorer subclasses implement the method:

Scorer#next ()--prefetch matches to the next document, which returns TRUE.
Scorer#doc ()--Returns the document ID of the current match, which must be called after next () to be valid.
Scorer#score ()--Returns the rating of the current document, which can be given by the application in any appropriate way, such as Termscorer returns TF * Weight.getvalue () * Fieldnorm
Scorer#skipto (int)--jumps to a matching document that is greater than or equal to int. In many cases, skipto the comparison cycle in the result set is faster and more efficient.
Scorer#explain (int)-gives the details of the scoring.

To achieve a set of Query, Weight, scorer, it is best to look at Termquery, Termweight, Termscorer.

When there are no queries in Lucene (including different scoring details), a custom query can be helpful.

Important Reference information:

Http://lucene.apache.org/java/2_4_1/scoring.html
Http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/package-summary.html#scoring
Http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Weight.html

Lucene scoring scoring mechanism

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More