[Elasticsearch] control relevance (2)-The PSF (Practical Scoring Function) in Lucene is upgraded during Query

Source: Internet
Author: User
Tags idf

[Elasticsearch] control relevance (2)-The PSF (Practical Scoring Function) in Lucene is upgraded during Query

 

 

Practical Scoring Function in Lucene

 

For Multiterm Queries, Lucene uses the Boolean Model, TF/IDF, and Vector Space Model to combine them, used to collect matching documents and calculate their scores.

Query multiple entries like the following:

GET /my_index/doc/_search{  query: {    match: {      text: quick fox    }  }}

Internally, It is rewritten as follows:

GET /my_index/doc/_search{  query: {    bool: {      should: [        {term: { text: quick }},        {term: { text: fox   }}      ]    }  }}

Bool queries implement the Boolean model. In this example, only documents that contain the entry quick, entry fox, or both are returned.

Once a document matches a query, Lucene calculates its score for the query and then combines the score of each matching entry. The formula used to calculate the score is called the Practical Scoring Function. It looks a little scary, but don't retreat-you already know the vast majority of the formula. Next we will introduce some new elements it introduces.

1   score(q,d)  = 2            queryNorm(q)  3          · coord(q,d)    4          · ∑ (           5                tf(t in d)   6              · idf(t)²      7              · t.getBoost() 8              · norm(t,d)    9            ) (t in q) 

The meaning of each line is as follows:

  1. Score (q, d) is the correlation score of document d for querying q.
  2. QueryNorm (q) is the Query Normalization Factor, which is newly added.
  3. Coord (q, d) is a Coordination Factor, which is newly added.
  4. The sum of the weights of each entry t in Article d for the query q.
  5. Tf (t in d) is the Term Frequency of the Term t in d ).
  6. Idf (t) is the inverted index Frequency of entry t (Inverse Document Frequency)
  7. T. getBoost () is a newly added Boost for queries.
  8. Norm (t, d) is the Field length reduction (Field-length Norm). It may be combined with Index-time Field-level Boost during the Index, which is the newly added part.

    You should know the meaning of score, tf, and idf. QueryNorm, coord, t. getBoost, and norm are newly added.

    Later in this chapter, we will discuss Query-time Boosting. First, we will explain the Query reduction, Coordination, and field level improvement during the index period.

    Query Normalization Factor)

    QueryNorm attempts to normalize a query so that the results of multiple queries can be compared.

    TIP

    Although the purpose of query reduction is to compare the results of different queries, it is not very effective. The only purpose of relevance score is to sort the results of the current query in the correct order. You should not try to compare the correlation score of Different queries.

    This factor is calculated at the beginning of the query. The actual calculation depends on the query, but a typical implementation is as follows:

    QueryNorm = 1/√ sumOfSquaredWeights

    SumOfSquaredWeights accumulates the IDF of each entry in the query and obtains the square root of the entry.

    TIP

    The same query reduction factor will be applied to each document, and you cannot change it. All in all, it can be ignored.

    Query Coordination

    Coordination factor (coord) is used to reward documents that contain more query entries. The more query entries appear in the document, the more likely the document will be a high-quality match for the query.

    We added the quick brown fox query, and each entry has a weight of 1.5. Without the Coordination factor, the score may be the sum of the weights of each entry in the document. For example:

    • Documents containing fox-> score: 1.5
    • Documents containing quick fox-> score: 3.0
    • Documents containing quick brown fox-> score: 4.5

      The Coordination factor will multiply the score value by the number of matched entries in the document and divide it by the total number of entries in the query. After the Coordination factor is used, the score is as follows:

      • Documents containing fox-> score: 1.5*1/3 = 0.5
      • Documents containing quick fox-> score: 3.0*2/3 = 2.0
      • Documents containing quick brown fox-> score: 4.5*3/3 = 4.5

        In the above results, the score of the document containing all three entries is much higher than that of the document containing only two entries.

        Remember that the query for quick brown fox will be overwritten by the bool query as follows:

        GET /_search{  query: {    bool: {      should: [        { term: { text: quick }},        { term: { text: brown }},        { term: { text: fox   }}      ]    }  }}

        By default, bool queries enable Coordination for all query clauses, but you can disable it. Why do you need to disable it? Well, the general answer is: no. Query Coordination usually plays a positive role. When you use bool queries to wrap multiple advanced queries such as match, enabling Coordination also makes sense. The more matched query clauses, the higher the degree of matching between the documents returned by your search results.

        However, in some advanced cases, disabling Coordination also makes sense. For example, you are querying synonyms such as jump, leap, and hop. You don't need to care how many times these synonyms appear, because they express the same concept. In fact, only one of them may appear. In this case, disabling the Coordination factor is a good choice:

        GET /_search{  query: {    bool: {      disable_coord: true,      should: [        { term: { text: jump }},        { term: { text: hop  }},        { term: { text: leap }}      ]    }  }}

        When you use Synonyms (See Synonyms), this occurs internally: The rewritten query disables Coordination for Synonyms. Most cases that disable Coordination will be automatically processed; you don't have to worry about it.

        Index-time Field-level Boosting)

        Here we will discuss how to promote a field-make this field more important than other fields-by using Query-time Boosting during Query ). It is also possible to improve a field during the index. In fact, this elevation applies to each entry of a field, rather than the field itself.

        In order to store the upgraded value to the index with as little space as possible, the increase in the field level during the index will be reduced together with the field length to be saved in the index in one byte. It is the value returned by norm (t, d) in the previous formula.

        Warning

        We strongly recommend that you do not use field-level indexes for the following reasons:

        • Storing this elevation and field length reduction in one byte means that the field length is subject to the dating loss precision. The result is that ES cannot distinguish a field containing three words and a field containing five words.
        • You have to re-index all documents to improve the index modification. The promotion during the query period can vary with the query.
        • If a Field that is promoted during indexing is a Multivalue Field, the promoted value performs multiplication for each value, resulting in a soaring weight of the Field.

          Query-time Boosting is simpler, concise, and flexible.

        After explaining the query reduction, Coordination, and index improvement, we can now start to discuss the most useful tool for influencing relevance computing: Improving during the query.


         

        Query-time Boosting)

         

        In the Prioritizing of query clause, we have introduced how to use the boost parameter to add weights to a query clause during search. For example:

        GET /_search{  query: {    bool: {      should: [        {          match: {            title: {              query: quick brown fox,              boost: 2             }          }        },        {          match: {             content: quick brown fox          }        }      ]    }  }}

        Promotion During query is the main tool used to optimize relevance. All types of queries accept the boost parameter. Setting boost to 2 does not simply double the final _ score. The exact increase value is normalized and obtained through some internal optimization. However, it also means that a sub-statement with an increase of 2 is twice more important than a sub-statement with an increase of 1.

        In fact, there is no formula to determine the correct increment value for a specific query clause. It is obtained by trying. Remember that boost is only one factor in the correlation score; it needs to compete with other factors. For example, in the above example, the title Field has a natural improvement relative to the content Field. This improvement comes from the Field length reduction (Field-length Norm) (because the title is usually shorter than the relevant content), do not blindly promote a field because you think it should be promoted. Apply an elevation value, check the result, and then modify it.

        Boosting an Index)

        When searching for Multiple indexes, you can use the indices_boost parameter to improve the entire index. In the following example, we will give more weight to the documents in the recent index:

        GET /docs_2014_*/_search {  indices_boost: {     docs_2014_10: 3,    docs_2014_09: 2  },  query: {    match: {      text: quick brown fox    }  }}

        This Multi-index Search queries all indexes starting with docs_2014. The document upgrade value in index docs_2014_10 is 3, the document upgrade value in index docs_2014_09 is 2, and the document upgrade value in other indexes is 1 by default.

        T. getBoost ()

        These elevation values are expressed by the t. getBoost () element in Lucene's Practical Scoring Function. The upgrade is not applicable where the query DSL appears. On the contrary, any promoted value is merged and then transmitted to each entry. The value returned by the t. getBoost () method is applicable to the increase value on the entry itself, or the increase value applied to upper-level queries.

        TIP

        In fact, reading and interpreting the API output itself is more complex than the preceding descriptions. You cannot see the boost value or t. getBoost () in your explanation (). Promoted to queryNorm applicable to specific terms. Although we have said that queryNorm is the same for any entry, queryNorm will be higher for the upgraded entry.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.