Lucene correlation point formula

Last Update:2018-12-05 Source: Internet

Author: User

Tags comparable idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Score_d = sum_t (tf_q * idf_t/norm_q * TF_D * idf_t/norm_d_t * boost_t )*
Coord_q_d

Note:

Score_d: score of the document d

Sum_t: sum of all items

Tf_q: the square root of the number of times an item is displayed in the query string Q.

TF_D: In document D, the square root of the number of occurrences of an item

Numdocs: In this index, find the total number of documents whose scores are greater than 0.

Docfreq_t: Total number of documents containing item t

Idf_t: log (numdocs/docfreq + 1) + 1.0

Norm_q:
SQRT (sum_t (tf_q * idf_t) ^ 2 ))

Norm_d_t: In document D, the square root of the total number of all items in the same domain as item t

Boost_t: increase factor of item T, generally 1.0

Coord_q_d: In document D, the number of hit items divided by the total number of items in the query Q

The official explanation is as follows:
Score (Q, d) = coord (Q, d)· Querynorm (q)· Σ (Tf (T in
D)· IDF (t)2. T. getboost ()· Norm (T, d)) T in Q
Where

TF (T in D)Correlates to the term's
Frequency, Defined as the number of times term
TAppears in the currently scored documentD.
Documents that have more occurrences of a given term receive a higher score.
Default ComputationTF (T in D)In defasimsimilarityIs:

TF (T in
D)= Frequency?

IDF (t)Stands
For inverse document frequency. This value correlates to the inverse
Docfreq(The number of statements in which the term
TAppears). This means rarer terms give higher contribution
The total score. The default ComputationIDF (t)In defasimsimilarityIs:

IDF (t)= 1 + Log (numdocs --------- docfreq + 1)

Coord (Q, d)Is a score factor based on how much of
Query terms are found in the specified document. Typically, a document that
Contains more of the query's terms will receive a higher score than another
Document with fewer query terms. This is a search time factor computed in coord (Q, d)By the similarity in effect
Search Time.

Querynorm (q)Is a normalizing factor used
To make scores between queries comparable. This factor does not affect document
Ranking (since all ranked documents are multiplied by the same factor),
Rather just attempts to make scores from Different queries (or even different
Indexes) Comparable. This is a search time factor computed by the similarity in
Effect at search time. The default computation in defasimsimilarityIs:

Querynorm (q) =
Querynorm (sumofsquaredweights)=
1 -------------- sumofsquaredweights?

The sum of squared weights (of
Query terms) is computed by the query weightObject. For example, a Boolean
QueryComputes this value:

Sumofsquaredweights= Q. getboost ()2 · Σ (IDF (t)· T. getboost ()) 2 t in Q

T. getboost ()Is a search time boost of term
TIn the queryQAs specified in the query
Text (see query syntax), Or as set by application cballs to setboost (). Notice that there is really no direct
API for accessing a boost of one term in a multi term query, but rather multi
Terms are represented in a query as multi termqueryObjects, and so the boost of a term in
The query is accessible by calling the sub-query getboost ().

Norm (T, d)
Encapsulates a few (indexing time) boost and length factors:Document
Boost-Set by calling Doc. setboost ()Before adding the document to
Index.Field boost-Set by calling field. setboost ()Before adding the field to
Document.Lengthnorm(Field)-
Computed when the document is added to the index in accordance with the number
Of tokens of this field in the document, so that shorter fields contribute more
To the score. lengthnorm is computed by the similarity class in effect
Indexing. When a document is added to the index, all the above factors are
Multiplied. If the document has multiple fields with the same name, all their
Boosts are multiplied together:

Norm (T, d) = Doc. getboost ()· Lengthnorm (field)· Effecf. getboost() Field
FInDNamed
T

However the resulted
NormValue is encodedAs a single byte before being stored.
At search time, the norm byte value is read from the INDEX DIRECTORYAnd decodedBack to a floatNorm
Value. This encoding/decoding, while cing index size, comes with the price
Of precision loss-it is not guaranteed that decode (encode (x) = x.
Instance, decode (encode (0.89) = 0.75. also notice that search time is too late
To modify thisNormPart of scoring, e.g. By using a different
SimilarityFor search.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More