[Elasticsearch] Proximity match (i)-phrase match and slop parameter

Source: Internet
Author: User

This article is translated from the proximity matching chapter of the official Elasticsearch guide.


Proximity matches (Proximity Matching)

A standard full-text search using TF/IDF the document, or at least every field in the document, as a "big bag of words" (big bags of Words). The match query tells us if our search terms are included in this bag, but this is only one aspect. It cannot tell us any information about the relationship between words.

Consider the differences between the following sentences:

    • Sue ate the alligator.
    • The alligator ate Sue.
    • Sue never goes anywhere without her alligator-skin purse.

A match query that uses Sue Alligator matches all of the above documents, but it does not tell us whether the two words represent part of the original part of the text, or express the full meaning.

Understanding the relationship between words is a complex problem, and we cannot solve this problem just by relying on another type of query, but we can at least use the distance between words to determine the possible relationship between words.

The real document may be much longer than the previous examples: Sue and alligator may be separated by several paragraphs. Perhaps we still want to include such a document, but we will give the higher relevance score for the closer documents that Sue and Alligator appear.

This is the phrase match (Phrase Matching), or proximity match (Proximity Matching).

TIP

In this chapter, we will still use the sample document used in the match query.



Phrase match (Phrase Matching)

Just as a full-text search would first think of a match query, when you need to look for several adjacent words, you will use the Match_phrase query:

get/my_index/my_type/_search{  " query " : {"  Match_ Phrase : { " title"  : "  quick brown fox "         }    }}

Similar to the match query, the Match_phrase query parses the query string first to produce a list of entries. All entries are then searched, but only documents containing all the search terms are kept, and the entries are placed adjacent to each other. A query against the phrase Quick Fox does not match any of our documents, because there is no document containing the quick and box entries that are contiguous together.

TIP

Match_phrase queries can also be written as a match query of type phrase:

 "match" : { "title" : { "query" : "quick brown fox" , "type" : "phrase" }}
Entry Location

When a string is parsed, the parser returns not only a list of terms, but also the location of each entry, or sequential information:

GET/_analyze?analyzer=standardquick Brown Fox

The following results are returned:

{ "tokens" : [      { "token" : "quick" , "start_offset" :0, "end_offset" :5, "type" : "<ALPHANUM>" , "position" :1},      { "token" : "Brown" , "start_offset" :6, "end_offset" : One, "type" : "<ALPHANUM>" , "position" :2},      { "token" : "Fox" , "start_offset" : A, "end_offset" : the, "type" : "<ALPHANUM>" , "position" :3}   ]}

Location information can be saved in an inverted index (inverted), where a query such as match_phrase (Position-aware) can use location information to match documents that contain the correct sequence of words, without inserting any other words between the words.

What's the phrase?

For documents that match the phrase "quick brown fox", the following conditions must be true:

    • Quick,brown and Fox must all appear in a field.
    • Brown's position must be 1 larger than the location of quick.
    • Fox's position must be 2 larger than the location of quick.

If any of these conditions are not met, then the document cannot be matched.

TIP

Internally, Match_phrase queries use a low-level span query family (query Family) to perform location-aware queries. span queries are entry-level queries, so they have no parsing phase (analysis Phase); they search directly for exact terms.

Fortunately, most users do not need to use span queries directly, because match_phrase queries are usually good enough. However, for some special fields, such as a patent search, these low-level queries are used to perform location searches with very special constructs, patent.



Mix (Mixing it up)

The exact phrase (Exact-phrase) match may be too restrictive. Perhaps we would like a "quick brown Fox" document to match the "quick Fox" query, even if the location is not exactly equal.

We can use the slop parameter in phrase matching to introduce some flexibility:

get/my_index/my_type/_search{ "query" : { "match_phrase" : { "title" : { "query" : "quick Fox" , "slop" :1}        }    }}

The slop parameter tells Match_phrase that the document will still be considered a match when the query entry can be far apart. How far apart does it mean that you need to move an entry several times to make the query and document match?

Let's illustrate this concept in a simple example. In order for query quick Fox to match the document containing the quick brown fox, we need a value of slop of 1:

 POS 1 pos 2 pos 3-----------------------------------------------DOC:     Quick brown Fox-----------------------------------------------query:quick foxslop 1:     Quick? Fox  

Although all words need to appear in the phrase match that uses slop, the order in which the words appear can be different. If the value of slop is large enough, then the order of the words can be arbitrary.

In order for the Fox Quick query to match our documentation, we need a value of slop of 3:

 POS 1 pos 2 pos 3-----------------------------------------------DOC:      Quick brown Fox-----------------------------------------------query:fox quickslop 1:  Fox|quick?  Slop 2:quick?     Foxslop 3:quick? Fox  



[Elasticsearch] Proximity match (i)-phrase match and slop parameter

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.