[Elasticsearch] Proximity match (i)-phrase match and slop number of references

Source: Internet
Author: User

This article is translated from the proximity matching chapter of the official Elasticsearch guide.


Proximity matches (Proximity Matching)

A standard full-text search using TF/IDF the document, or at least every field in the document, as a "big bag of words" (big bags of Words). The match query tells us whether our search terms are included in this bag, but this is only one aspect. It doesn't tell us any information about the relationship between words.

Consider the differences between the following sentences:

    • Sue ate the alligator.
    • The alligator ate Sue.
    • Sue never goes anywhere without her alligator-skin purse.

A match query that uses Sue Alligator matches all of the above documents, but it does not tell us whether the two words represent part of the original part of the text, or express a complete meaning.

Understanding the relationship between words is a complex problem, and we cannot solve the problem with just one type of query, but at least we can infer the possible relationship between words by the distance between words.

The real document may be a lot longer than the examples above: Sue and Alligator may be separated by several paragraphs. Maybe we still want to include this kind of document, but we will give the higher relevance score for the closer document that Sue and Alligator appear.

This is the phrase match (Phrase Matching), or proximity match (Proximity Matching).

TIP

In this chapter, we will still use the Demo sample document that is used in the match query.



Phrase match (Phrase Matching)

Just as a full-text search would first think of a match query, when you need to look for several adjacent words, you will use Match_phrase query:

get/my_index/my_type/_search{  " query " : {"  Match_ Phrase : { " title"  : "  quick brown fox "         }    }}

Similar to the match query, the Match_phrase query parses the query string first to produce a list of entries. The entire entry is then searched, but only documents containing all of the search terms are kept, and the entry is placed adjacent. A query against the phrase Quick Fox does not match any of our documents, as there are no documents containing the quick and box entries that are contiguous together.

TIP

The Match_phrase query can also be written as a match query of type phrase:

 "match" : { "title" : { "query" : "quick brown fox" , "type" : "phrase" }}
Entry Location

When a string is parsed, the parser not only returns a list of entries, it also returns the location of each entry at the same time, or sequential information:

GET/_analyze?analyzer=standardquick Brown Fox

The following results are returned:

{ "tokens" : [      { "token" : "quick" , "start_offset" :0, "end_offset" :5, "type" : "<ALPHANUM>" , "position" :1},      { "token" : "Brown" , "start_offset" :6, "end_offset" : One, "type" : "<ALPHANUM>" , "position" :2},      { "token" : "Fox" , "start_offset" : A, "end_offset" : the, "type" : "<ALPHANUM>" , "position" :3}   ]}

Location information can be saved in an inverted index (inverted), where a query such as match_phrase (Position-aware) can use location information to match documents that contain the correct sequence of words, without inserting any other words between them.

What's the phrase?

For documents that match the phrase "quick brown fox", the following conditions must be true:

    • Quick,brown and Fox must all be out of a field today.
    • Brown's position must be 1 larger than the location of quick.
    • Fox's position must be 2 larger than the location of quick.

Assuming that none of the above conditions are met, the document cannot be matched.

TIP

Internally, Match_phrase queries use a low-level span query family (query Family) to run location-aware queries. span queries are entry-level queries, so they have no parsing phase (analysis Phase); they search directly for exact terms.

Fortunately, most users almost don't need to use span queries directly, because match_phrase queries are usually good enough. However, for some special fields, such as patent search (patent), these low-level queries are used to run location searches with very special constructs.



Mix (Mixing it up)

The exact phrase (Exact-phrase) match may be too restrictive. Perhaps we would like a "quick brown Fox" document to match the "quick Fox" query, even if the location is not entirely equal.

We are able to introduce some flexibility in phrase matching using slop parameters:

get/my_index/my_type/_search{ "query" : { "match_phrase" : { "title" : { "query" : "quick Fox" , "slop" :1}        }    }}

The slop parameter tells Match_phrase that the document will still be considered a match when the query entry can be far apart. How far apart does it mean that you need to move an entry several times to make the query and document match?

We illustrate this concept in a simple example. In order for query quick Fox to match the document containing the quick brown fox, we need a value of slop of 1:

 POS 1 pos 2 pos 3-----------------------------------------------DOC:     Quick brown Fox-----------------------------------------------query:quick foxslop 1:     Quick? Fox  

Although all words need to appear in the phrase match that uses slop, the order in which the words appear can be different. Assuming that the value of slop is large enough, the order of the words can be arbitrary.

In order for the Fox Quick query to match our documentation, the value of slop is 3:

 POS 1 pos 2 pos 3-----------------------------------------------DOC:      Quick brown Fox-----------------------------------------------query:fox quickslop 1:  Fox|quick?  Slop 2:quick?     Foxslop 3:quick? Fox  



[Elasticsearch] Proximity match (i)-phrase match and slop number of references

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.