Conclusion 6: mathematical derivation of Lucene scoring formula

Last Update:2018-12-08 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before parsing the Lucene search process, it is necessary to separate the Lucene score formula and describe the meaning of each part. Because of Lucene's search process, a very important step is to gradually calculate the scores of each part.

Lucene's scoring formula is very complex, as follows:

Before derivation, we will introduce the meaning of each part one by one:

T: Term. The Term here refers to the Term that contains the domain information, that is, the title: hello and content: hello are different terms.
Coord (q, d): a single search may contain multiple search words, and a document may also contain multiple search words. This field indicates that when a document contains more search words, the score is higher.
QueryNorm (q): calculates the variance sum of each query entry. This value does not affect sorting, but only allows comparison of scores between Different queries. The formula is as follows:

Tf (t in d): Term t frequency in document d
Idf (t): Term t appears in several documents
Norm (t, d): Standardization factor, which includes three parameters:
- Document boost: the larger the value, the more important this Document is.
- Field boost: the larger the Field, the more important the Field is.
- LengthNorm (field) = (1.0/Math. sqrt (numTerms): The larger the total number of terms in a domain, that is, the longer the document. The smaller the value, the shorter the document, and the larger the value.

Various Boost values
- T. getBoost (): the weight of each word in the query statement. You can set a word to be more important in the query. common ^ 4 hello
- D. getBoost (): Document weight, which is written to the nrm file during the indexing phase, indicates that some documents are more important than others.
- F. getBoost (): the weight of the domain. When the nrm file is written in the index phase, it indicates that some domains are more important than other domains.

The document above has been detailed in Lucene and has been elaborated in many articles. how to adjust the above sections to influence the score of the document, refer to Question about Lucene (4): four ways that affect Lucene's scoring.

However, why do the above parts need to be calculated together like this? How can we get such a complicated formula? Let's deduce it.

First, we can substitute the above parts into the score (q, d) formula to get a very complex formula. Let's ignore all the boost, because these are manual adjustments, coord is also omitted, which is irrelevant to the principle to be expressed by the formula. Obtain the following formula:

Then, let's take a look at the description in Lucene's learning Summary: We know that Lucene's scoring mechanism uses the vector space model:

We regard the document as a series of words (Term), each word (Term) has a weight (Term weight), different words (Term) the scoring Calculation of document relevance is affected based on the weight of the document.

So we regard all the term weights in this document as a vector.

Document = {term1, term2 ,...... , Term N}

Document Vector = {weight1, weight2 ,...... , Weight N}

Similarly, we regard the query statement as a simple document, which is also expressed by vectors.

Query = {term1, term 2 ,...... , Term N}

Query Vector = {weight1, weight2 ,...... , Weight N}

We place all the searched document vectors and query vectors in an n-dimensional space. Each term is one-dimensional.

The smaller the angle between two vectors, the greater the correlation.

Therefore, we calculate the cosine of the angle as the correlation score. The smaller the angle, the larger the cosine value, the higher the score, and the greater the correlation.

The cosine formula is as follows:

Let's assume that:

The query vector is Vq = <w (t1, q), w (t2, q ),......, W (tn, q)>

The document vector is Vd = <w (t1, d), w (t2, d ),......, W (tn, d)>

The vector space dimension is n, which is the length of the Union set of query statements and documents. When a Term does not appear in the query statement, w (t, q) is zero, when a Term does not appear in the document, w (t, d) is zero.

W represents weight, and the calculation formula is generally tf * idf.

First, we calculate the molecular part of the cosine formula, that is, the dot product of two vectors:

Vq * Vd = w (t1, q) * w (t1, d) + w (t2, q) * w (t2, d) + ...... + W (tn, q) * w (tn, d)

Enter the w Formula

Vq * Vd = tf (t1, q) * idf (t1, q) * tf (t1, d) * idf (t1, d) + tf (t2, q) * idf (t2, q) * tf (t2, d) * idf (t2, d) + ...... + Tf (tn, q) * idf (tn, q) * tf (tn, d) * idf (tn, d)

There are three points to note:

Because it is a dot product, t1, t2 ,......, Tn only has a non-zero value in the Union of the query statement and the document. The value of tn is zero only for the Term that appears in the query statement or only in the document.
When querying, few people enter the same word in the query statement. Therefore, assume that tf (t, q) is 1.
Idf refers to the number of articles in which a Term appears, including the query statement. Therefore, idf (t, q) and idf (t, d) are actually the same, is the total number of documents in the index plus one. When the total number of documents in the index is large enough, the query statement can be ignored in this small document. Therefore, assume that idf (t, q) = idf (t, d) = idf (t)

Based on the above three points, the dot product formula is:

Vq * Vd = tf (t1, d) * idf (t1) * idf (t1) + tf (t2, d) * idf (t2) * idf (t2) + ...... + Tf (tn, d) * idf (tn)

So the cosine formula is changed:

The following is the length of the query statement.

As discussed above, tf is 1 in the query statement, and idf ignores the query statement. The following formula is obtained:

So the cosine formula is changed:

The following is the document length. The formula for the document length should be as follows:

What needs to be discussed here is, why should we divide the score by the document length?

Because in the index, the length of different documents is different. Obviously, for any term, the tf in the long document is much larger, so the score is higher, this is unfair to small documents. To give an extreme example, the word "lucene" appears 11 times in a 10 million-word article, in a short article of 12 words, the word "lucene" appears 10 times. If the length is not taken into account, of course, the score should be higher, however, obviously this small document really focuses on lucene.

However, if the length of a document is completely eliminated based on the standard cosine formula, it is unfair to the long document (after all, it contains more information ), it is preferred to return short documents first, which makes the search results ugly in practical applications.

Therefore, in Lucene, the Similarity lengthNorm interface is open. You can rewrite the lengthNorm formula based on your application needs. For example, I want to create a search system for an economic paper. After some research, I found that the length of most economic papers is between 8000 and 10000 words, therefore, the lengthNorm formula should be an inverted parabolic model. The scores of 8000 to 10000 words are the highest, and the scores of shorter or longer words are always low. Therefore, lengthNorm can return the best data to users.

By default, Lucene uses defasimsimilarity. It is considered that when calculating the vector length of a document, the weight of each Term is not taken into account, but all is one.

From the definition of Term, we can know that Term contains domain information, that is, title: hello and content: hello are different terms, that is, a Term can only appear in one domain of the document.

Therefore, the formula for document length is:

Cosine formula:

With a variety of boost and coord, we can get the Lucene scoring calculation formula.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More