Lucene correlation point formula

Source: Internet
Author: User
Tags comparable idf
Score_d = sum_t (tf_q * idf_t/norm_q * TF_D * idf_t/norm_d_t * boost_t )*
Coord_q_d

Note:

Score_d: score of the document d

Sum_t: sum of all items

Tf_q: the square root of the number of times an item is displayed in the query string Q.

TF_D: In document D, the square root of the number of occurrences of an item

Numdocs: In this index, find the total number of documents whose scores are greater than 0.

Docfreq_t: Total number of documents containing item t

Idf_t: log (numdocs/docfreq + 1) + 1.0

Norm_q:
SQRT (sum_t (tf_q * idf_t) ^ 2 ))

Norm_d_t: In document D, the square root of the total number of all items in the same domain as item t

Boost_t: increase factor of item T, generally 1.0

Coord_q_d: In document D, the number of hit items divided by the total number of items in the query Q

The official explanation is as follows:
Score (Q, d) = coord (Q, d)· Querynorm (q)· Σ (Tf (T in
D)· IDF (t)2. T. getboost ()· Norm (T, d)) T in Q
Where

TF (T in D)Correlates to the term's
Frequency, Defined as the number of times term
TAppears in the currently scored documentD.
Documents that have more occurrences of a given term receive a higher score.
Default ComputationTF (T in D)In defasimsimilarityIs:

TF (T in
D)= Frequency?

IDF (t)Stands
For inverse document frequency. This value correlates to the inverse
Docfreq(The number of statements in which the term
TAppears). This means rarer terms give higher contribution
The total score. The default ComputationIDF (t)In defasimsimilarityIs:

IDF (t)= 1 + Log (numdocs --------- docfreq + 1)

Coord (Q, d)Is a score factor based on how much of
Query terms are found in the specified document. Typically, a document that
Contains more of the query's terms will receive a higher score than another
Document with fewer query terms. This is a search time factor computed in coord (Q, d)By the similarity in effect
Search Time.

Querynorm (q)Is a normalizing factor used
To make scores between queries comparable. This factor does not affect document
Ranking (since all ranked documents are multiplied by the same factor),
Rather just attempts to make scores from Different queries (or even different
Indexes) Comparable. This is a search time factor computed by the similarity in
Effect at search time. The default computation in defasimsimilarityIs:

Querynorm (q) =
Querynorm (sumofsquaredweights)=
1 -------------- sumofsquaredweights?

The sum of squared weights (of
Query terms) is computed by the query weightObject. For example, a Boolean
QueryComputes this value:

Sumofsquaredweights= Q. getboost ()2 · Σ (IDF (t)· T. getboost ()) 2 t in Q

T. getboost ()Is a search time boost of term
TIn the queryQAs specified in the query
Text (see query syntax), Or as set by application cballs to setboost (). Notice that there is really no direct
API for accessing a boost of one term in a multi term query, but rather multi
Terms are represented in a query as multi termqueryObjects, and so the boost of a term in
The query is accessible by calling the sub-query getboost ().

Norm (T, d)
Encapsulates a few (indexing time) boost and length factors:Document
Boost
-Set by calling Doc. setboost ()Before adding the document to
Index.Field boost-Set by calling field. setboost ()Before adding the field to
Document.Lengthnorm(Field)-
Computed when the document is added to the index in accordance with the number
Of tokens of this field in the document, so that shorter fields contribute more
To the score. lengthnorm is computed by the similarity class in effect
Indexing. When a document is added to the index, all the above factors are
Multiplied. If the document has multiple fields with the same name, all their
Boosts are multiplied together:

Norm (T, d) = Doc. getboost ()· Lengthnorm (field)· Effecf. getboost() Field
FInDNamed
T

However the resulted
NormValue is encodedAs a single byte before being stored.
At search time, the norm byte value is read from the INDEX DIRECTORYAnd decodedBack to a floatNorm
Value. This encoding/decoding, while cing index size, comes with the price
Of precision loss-it is not guaranteed that decode (encode (x) = x.
Instance, decode (encode (0.89) = 0.75. also notice that search time is too late
To modify thisNormPart of scoring, e.g. By using a different
SimilarityFor search.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.