The relevance of Lucene score

Source: Internet
Author: User
Tags modulus idf

Official document Http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.htmlterm: not a simple key. Is Field-key, the key under the specified domainfactors that affect scoringCoord:document hit query in the number of term (not count, is the number of different term) term.tf:term in the corresponding field frequency TERM.IDF: The number of document containing this term Query.boost:query weights (when search is set) the weight of the term in term.boost:query (when search is set) the weight of the Doc.boost:doc (determined when writing to Doc) Field.boost:term the right of the corresponding domain (determined when writing to doc) the length of the corresponding field of the field.norm:term (inversely proportional to the length) ================================================================== ======================

Lucene's scoring formula

Score (Q,D) = Coord (q,d) · Querynorm (q) · ( TF (T in D) •  IDF (t) 2 ·  T.getboost () · Norm (t,d) )
T in Q

This formula is derived from the basic space vector formula after simplifying and then adding various boost variants.

The first is to explain the concept transformation between the correlation degree and the space vector.

1 Each article is composed of a number of different term, if each term represents an independent content meaning, then in multidimensional space, each term is a direction vector of the space.

2 article content to express the general meaning is also a direction vector, the direction of the direction of the vector and the model is affected by all the term vectors in this article (it can also be said that the general meaning of the direction of the vector is a combination of all the word vectors together).

3 If the modulus of a term vector is significantly longer than other term vectors, it is more able to guide the direction of the total vector biased toward it (such as v1+v2=v3, then if the V1 modulus is much larger than the V2, in fact V3 Infinite approximation V1 direction). Therefore, each term vector of the module is affecting the direction of the total vector (in fact, the total meaning of the article expression). Then if the module size of the term vector is represented by the TF and IDF corresponding to the term, then the mathematical relationship between the TF,IDF and the general vector direction of the article can be established-that is, TF and IDF can influence the general meaning of the article, so that each term's TF, IDF and the general meaning of the article were strung together in mathematical relationships.

4 The mathematical model obtained from the above is the article of the various term TF and IDF jointly affect the general meaning of the article,

The vector of Doc can be expressed as

Vd= (D_T1_TF * D_T1_IDF, D_T2_TF * D_T2_IDF, D_T3_TF * D_T3_IDF, ... d_tn_tf * D_TN_IDF)

D_TN_TF represents the nth term TF in doc

D_TN_IDF represents the IDF of the nth term in doc

Query statements can also be thought of as a short doc, so the vector of the queries can be expressed as

vq= (Q_T1_TF * Q_T1_IDF, Q_T2_TF * Q_T2_IDF, Q_T3_TF * Q_T3_IDF, ... q_tn_tf * Q_TN_IDF)

Q_TN_TF represents the nth term TF in query

Q_TN_IDF represents the IDF of the nth term in query

From the above 123 points can be known V_doc and v_query the total meaning of the doc and query respectively, and the smaller the angle of the 2 vectors, the closer the direction also shows that the meaning of the two expressions more closely related-the higher the correlation, so we calculate the cosine of the angle as the relevance of the score, The larger the cosine value, the smaller the angle, the greater the correlationThe next step is to convert the formula into Lucene's standard form.Molecular vq*vd= (D_T1_TF * D_T1_IDF, ... d_tn_tf * d_tn_idf) * (Q_T1_TF * Q_T1_IDF, ... q_tn_tf * q_tn_idf) =D_T1_TF * d_ T1_IDF * Q_T1_TF * Q_T1_IDF + ... + d_tn_tf * D_TN_IDF * Q_TN_TF * Q_TN_IDF 1. Because query is a lookup statement, you can assume that in most cases the term of each query is only once in query , so q_tn_tf=12 because D_T1_IDF and Q_T1_IDF all represent the importance of the term in the full document, so the 2 values are no different from the TN_IDF so the molecule is simplified into VQ*VD=D_T1_TF * T1_IDF * T1_IDF + ... + d_ TN_TF * TN_IDF * TN_IDF

That

( d_tn_tf TN_IDF2 )

1->n

And then the denominator.

| V (q) | = ((Q_T1_TF * Q_T1_IDF) 2 + ... + (Q_TN_TF * q_tn_idf) 2) ½

Because Q_TN_TF said it could be seen as 1, so

| V (q) | = (Q_t1_idf2 + ... + q_tn_idf2) ½

That

(∑ ( q_tn_idf2 )) ½

1->n

| V (d) | is the total length of doc

Because VQ*VD=D_T1_TF * T1_IDF * T1_IDF + ... + d_tn_tf * TN_IDF * TN_IDF,

So if you don't look at it first | V (q) |

V (q) · V (d) 1
–––––––––= (D_T1_TF * t1_idf2 + ... + d_tn_tf * tn_idf2) *––––
|                                                                                                              V (d) | | V (d) |

D_T1_TF * T1_IDF2 D_TN_TF * T1_IDF2

=––––––––––––+...+––––––––––––

|                             V (d) | | V (d) |

In this equation, the logical meaning of each of the polynomial molecules is to indicate the correlation score of each term in the corresponding field (in combination with its frequency in its corresponding field (TF) and the term itself in the full document of importance (IDF), and the denominator of the division is all | V (d) | (The total length of doc), so dividing results in a lack of accuracy in correlation values

Because of different document lengths in the index, it is clear that for any term, the TF in the long document is much larger, so the score is higher, so that the small document is unfair, to give an extreme example, in a 10 million-word masterpiece, "Lucene" appeared 11 times, In a short document of 12 words, the word "Lucene" appears 10 times, but if you do not take into account the length, of course, the great masterpiece should be higher scores, but obviously this small document is really concerned about "Lucene".

So the denominator should not be | V (d) | (The total length of doc), and should be the total number of term on each term corresponding to the numerator, i.e. "Num of term in field F"

Is the lengthnorm in Lucene. This value equals (1.0/math.sqrt ("Num of term in field F")

So the above-mentioned formula can be converted into

That

( (D_TN_TF TN_IDF2)/("Num of term in field F") ½)

1->n

Plus the Denominator | V (q) |:

This is the prototype formula that is transformed by the pure vector formula plus Lucene's modification of some of the indicators in it, and if you multiply the various boost, it is the same as the general formula written on the document.

It says that Lucene uses the vector formula to simplify and modify several factors in the formula, summing up

1 Q_TN_TF (the frequency of term in query) is reduced to 1

2 D_T1_IDF and Q_T1_IDF represent the importance of the term in the full document, so 2 values are no different.

3 | V (d) | (DOC length), the correlation fraction of each term in the corresponding field after conversion according to the vector formula | V (d) | Do except, this does not conform to the actual relevance of the situation (the reasons explained above), so removed | V (d) |, replaced by lengthnorm (representing the total number of term in each term in the molecule).

The concrete realization of the various factors in the formula

public float coord (int overlap, int maxoverlap) {
return overlap/(float) maxoverlap;
}

/** implemented as &LT;CODE&GT;1/SQRT (sumofsquaredweights) </code>. */
@Override
public float querynorm (float sumofsquaredweights) {
return (float) (1.0/math.sqrt (sumofsquaredweights));
}

Public float Lengthnorm (fieldinvertstate state) {
final int numterms;
if (discountoverlaps)
Numterms = State.getlength ()-State.getnumoverlap ();
Else
Numterms = State.getlength ();
return State.getboost () * ((float) (1.0/math.sqrt (numterms)));
}

public float TF (float freq) {
return (float) math.sqrt (freq);
}

/** The default implementation returns <code>1</code> * *
@Override
public float scorepayload (int doc, int start, int end, bytesref payload) {
return 1;
}

/** implemented as <code>log (numdocs/(docfreq+1)) + 1</code>. */
@Override
Public float IDF (long docfreq, long Numdocs) {
return (float) (Math.log (numdocs/(double) (docfreq+1)) + 1.0);
}

 

The relevance of Lucene score

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.