[Elasticsearch] Control Correlation (ii)-PSF in Lucene (practical scoring Function) and elevation during query

Source: Internet
Author: User

This chapter is translated from the controlling relevance chapter of the Official Elasticsearch guide.


Practical scoring Function in Lucene

For multiple-entry queries (Multiterm Queries), Lucene uses Boolean models, TF/IDF, and vector space models to combine them. Used to collect matching documents and to calculate their scores.

Multiple-entry queries such as the following:

get/my_index/doc/_search{  "Query": {    "match" text"quick Fox"  }}} 

Inside is rewritten as follows:

get/my_index/doc/_search{  "Query": {    "bool" should": [{" term"text  "Quick"}}, {"term" text"Fox"        }}}}} 

BOOL Query implements the Boolean model, in this case, only the entry is included in quick, the entry fox or both of the documents contained will be returned.

Once a document matches a query, Lucene calculates its score for the query, and then combines the scores of each matching entry. The formula used to calculate the score is called practical scoring Function. It looks a little scary, but don't retreat-the vast majority of the formula you already know. Here we introduce some of the new elements that it introduces.

1   score(q,d)  = 2            queryNorm(q)  3          · coord(q,d)    4          · ∑ (           5                tf(t in d)   6              · idf(t)2      7              · t.getBoost() 8              · norm(t,d)    9            

The meaning of each line is as follows:

    1. Score (Q,D) is the correlation score of document D for query Q.
    2. Querynorm (q) is the query normalization Factor, which is the newly added part.
    3. Coord (Q,D) is the coordination Factor, which is the newly added part.
    4. The sum of the weights for query Q in document D for each entry T.
    5. TF (T in D) is the term frequency of the entry T in document D (term Frequency).
    6. IDF (t) is the inverted index frequency for entry T (inverse Document Frequency)
    7. T.getboost () is the promotion (boost) that is applied to the query and is the newly added part.
    8. Norm (T,d) is the field length (Field-length norm), which may be combined with an index period field promotion (Index-time Field-level Boost), which is the newly added part.

You should know the meaning of SCORE,TF and the IDF. Querynorm,coord,t.getboost and Norm are newly added.

Later in this chapter we will discuss the query period elevation (Query-time boosting), which first interprets query normalization, coordination, and field-level elevation during indexing.

Query attribution factor (normalization Factor)

The query Attribution factor (Querynorm) attempts to approximate a query so that the results of multiple queries can be compared.

TIP

Although the purpose of query attribution is to compare the results of different queries, it does not work well. The sole purpose of the correlation _score is to sort the results of the current query in the correct order. You should not try to compare the relative scores of different queries.

This factor is calculated at the beginning of the query. The actual calculation depends on the query itself, but a typical implementation is as follows:

Querynorm = 1/√sumofsquaredweights

Sumofsquaredweights is obtained by summing the IDF of each entry in the query and then taking its square root.

TIP

The same query attribution is applied to each document and you have no way to change it. In a word, it can be ignored.

Query Coordination

The coordination Factor (coord) is used to reward documents that contain more query entries. The more query entries appear in the document, the more likely the document is to be a high-quality match for that query.

Join us for a quick brown fox, with 1.5 weights for each entry. When there is no coordination factor, the score may be the sum of the weights of each entry in the document. Like what:

    • Documents with Fox Score: 1.5
    • Documents with Quick FOX Score: 3.0
    • Documents with quick brown Fox Score: 4.5

The coordination factor multiplies the score by the number of matching entries in the document, and then divides the total number of entries in the query. After using the coordination factor, the score is this:

    • Documents with FOX Score: 1.5 * 1/3 = 0.5
    • Documents with Quick FOX Score: 3.0 * 2/3 = 2.0
    • Documents with quick brown Fox Score: 4.5 * 3/3 = 4.5

In the results above, the score for the document containing all three entries would be much higher than a document containing only two entries.

Remember that queries for quick brown fox are rewritten by bool queries as follows:

GET/_search{The query": {"BOOL" Should ": [{" Term" Text "}}, {  "Term": { "text " Brown "}}, {  "Term": { "text " Fox "}"}}}    

The BOOL query enables query coordination by default for all should query clauses, but you can disable it. Why do you need to disable it? Well, the usual answer is, it's not necessary. Query coordination usually play a positive role. When you use a bool query to wrap multiple advanced queries such as match (high-level query) together, it makes sense to enable coordination. The more matching query clauses, the higher the match between your search Chen request and the returned document.

However, in some advanced use cases, disabling coordination is also meaningful. For example, you are querying synonyms jump,leap and hop. You don't have to care how many times these synonyms appear, because they express the same concept. In fact, only one of them may be present. At this point, disabling the coordination factor is a good choice:

GET/_search{The query": {"BOOL": {"Disable_coord":true,  "Should": [{" Term" Text "}}, {< Span class= "Pl-pds" > "Term": { "text " Hop "}}, {  "Term": { "text " Leap "}"}}}    

When you use synonyms (refer to synonyms (synonyms)), this is what happens internally: The rewritten query disables coordination for synonyms. Most use cases that disable coordination are handled automatically; you don't have to worry about it at all.

Field-level elevation during indexing (Index-time Field-level boosting)

Now let's talk about field elevation-making the field more important than other fields-by using the query during query promotion (Query-time boosting). It is also possible to promote a field during the index. In fact, the promotion applies to each entry in the field, not the field itself.

To store the promoted values in the index in order to occupy as little space as possible, the field-level promotion is held in the index with the field length in one byte when the index is saved. It is the value returned by Norm (T,d) in the previous formula.

Warning

We strongly recommend that you do not use the field-level index during the promotion for the following reasons:

  • Storing this promotion and the field length in one byte means that the field length is attributed to the appointment loss precision. The result is that ES cannot distinguish between a field containing three words and a field with five words.
  • In order to modify the promotion during the index, you have to re-index all documents. The promotion during the query can vary depending on the query.
  • If a field that is promoted during an index is used as a multivalued field (multivalue field), the ascending value is multiplied for each value, causing the field's weight to soar.

Enhanced (Query-time boosting) is simpler, more concise, and more flexible during queries.

After explaining the query normalization, coordination, and the elevation of the index period, you can now start discussing the tools that are most useful for affecting relevance calculations: elevation during query.

Promotion during query (Query-time boosting)

In the section on tuning query clause precedence (prioritizing clauses), we've covered how to use the boost parameter to increase the weight of a query clause during a search. Like what:

GET/_search{The query": { "Bool": { "should " Match ": {< Span class= "Pl-pds" > "Title": { "query " quick brown Fox " "Boost": 2}}}, {  "Match": { "Content "quick brown Fox"}} "}}}   /span>                

Elevation during query is the primary tool for tuning relevance. Any type of query accepts the boost parameter. Setting boost to 2 does not simply double the final _score; the exact lift value is normalized and some internal optimizations are obtained. However, it also means that a clause with a lift value of 2 is twice times more important than a clause with a promotion value of 1.

In fact, there is no formula that determines what the "correct" elevation value should be for a particular query clause. It is obtained by trying. Remember that boost is just one factor in the correlation score; it needs to compete with other factors. For example, in the example above, the title field probably has a "natural" elevation relative to the Content field, which is derived from the field length (Field-length Norm) (because the title is usually shorter than the relevant content). So don't blindly promote a field because you think it should be promoted. Apply a lift value and then check the resulting result, and then fix it.

Promotion index (boosting an index)

When searching in multiple indexes, you can promote the entire index with the Indices_boost parameter. In the following example, more weights are given to the documents in the recent index:

Get/docs_2014_*/_search {"  indices_boost": {     "docs_2014_10" docs_2014_09"query "Match" text"quick brown fox"          }}}

This multi-index search queries all indexes that start with docs_2014_. Multi-Index. The promotion value of the document in index DOCS_2014_10 is 3, the promotion value of the document in index docs_2014_09 is 2, and the promotion value of the document in other indexes is the default value of 1.

T.getboost ()

These lift values are expressed in Lucene's practical scoring function through the t.getboost () element. Ascension is not applicable where the query DSL appears. Conversely, any ascending value is merged and then passed on to each entry. The T.getboost () method returns an elevation value that applies to the entry itself, or to an elevated value for the upper-level query.

TIP

In fact, reading the output of the explanatory API itself is more complex than the above description. You can't see the boost value or T.getboost () in the explanation. Ascension is fused to the querynorm that apply to a particular term. Although we have said that Querynorm is the same for any term, the querynorm will be higher for the promoted entry.

[Elasticsearch] Control Correlation (ii)-PSF in Lucene (practical scoring Function) and elevation during query

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.