[Elasticsearch] adjacent match (2)-multi-value field, degree of closeness and relevance

Source: Internet
Author: User

[Elasticsearch] adjacent match (2)-multi-value field, degree of closeness and relevance
Multivalue Fields)

Using phrase matching on multi-value fields produces odd behavior:

PUT /my_index/groups/1{    "names": [ "John Abraham", "Lincoln Smith"]}

Run a phrase query for Abraham Lincoln:

GET /my_index/groups/_search{    "query": {        "match_phrase": {            "names": "Abraham Lincoln"        }    }}

Surprisingly, the above document matches the query. Even if both Abraham and Lincoln are in the name array. The cause of this phenomenon is the index method of the array in ES.

When John Abraham is parsed, it generates the following information:

  • Location 1: john
  • Location 2: abraham

    When Lincoln Smith is parsed, it generates:

    • Location 3: lincoln
    • Location 4: smith

      In other words, ES generates the same list of entries for the preceding array analysis as it does when parsing a single string, John Abraham Lincoln Smith. In our query, We query the adjacent abraham and lincoln, and these two entries exist in the index and are adjacent, so the query matches.

      Fortunately, there is a simple way to avoid this situation. With the position_offset_gap parameter, It is configured in field ing:

      DELETE /my_index/groups/ PUT /my_index/_mapping/groups {    "properties": {        "names": {            "type":                "string",            "position_offset_gap": 100        }    }}

      Position_offset_gap indicates that ES needs to set a deviation value for each new element in the array. Therefore, when we re-index the above person name array, the following results will be generated:

      • Location 1: john
      • Location 2: abraham
      • Location 3: lincoln
      • Location 4: smith

        Now our phrase match cannot match this document, because the distance between abraham and lincoln is 100. You must add a slop value of 100 to match.



        The Closer the better (Closer is better)

        Phrase Query simply removes documents that do not contain a specific Query Phrase, and Proximity Query) -A phrase query with a slop value greater than 0 takes the closeness of the query entry into consideration the final relevance score. By setting a high slop value such as 50 or 100, you can exclude documents with words too far, but also give documents with adjacent words a higher score.

        The proximity query for quick dog matches two documents containing quick and dog, but gives quick and dog a higher score:

        POST /my_index/my_type/_search{   "query": {      "match_phrase": {         "title": {            "query": "quick dog",            "slop":  50          }      }   }}
        {  "hits": [     {        "_id":      "3",        "_score":   0.75,         "_source": {           "title": "The quick brown fox jumps over the quick dog"        }     },     {        "_id":      "2",        "_score":   0.28347334,         "_source": {           "title": "The quick brown fox jumps over the lazy dog"        }     }  ]}


        Use closeness to improve relevance

        Although Proximity Query is useful, all entries must appear in the document. This requirement is too strict. This problem is similar to what we have discussed in the Controlling Precision section of the Full-Text Search chapter: If six of the seven entries match, this document may be relevant to the user, but the match_phrase query will exclude it.

        Compared with the proximity matching as an absolute requirement, we can regard it as a Signal-as a member of many potential matches, contribute to the final score of each document (refer to Most Fields (Most Fields )).

        The fact that we need to accumulate the scores of multiple queries indicates that we should use bool queries to merge them.

        We can use a simple match query as an must clause. This query is used to determine which documents need to be included in the result set. The minimum_should_match parameter can be used to remove Long tail ). Then we add more specific queries in the form of a shocould clause. Each document that matches the shocould clause will increase its relevance.

        GET /my_index/my_type/_search{  "query": {    "bool": {      "must": {        "match": {           "title": {            "query":                "quick brown fox",            "minimum_should_match": "30%"          }        }      },      "should": {        "match_phrase": {           "title": {            "query": "quick brown fox",            "slop":  50          }        }      }    }  }}

        There is no doubt that we can add other queries to the shocould clause. Each query is used to increase the relevance of a specific type.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.