Partial match (partial Matching)
A keen reader may have discovered that the queries presented so far have been manipulated throughout the entry level. The smallest unit that matches must be an entry. You can only find entries that exist in the inverted index (inverted).
But what if you want to match part of the entry, not the whole term? Partial match (partial Matching) allows the user to specify part of the entry and then find any word that contains that part.
Match part of the term this requirement is less common in full-text search engine areas than you might think. If you have a background in SQL, you may have experienced the following SQL statement to implement a simple full-text search feature:
WHERE text like "*quick*" and text like "*brown*" and text like "*fox*"
Of course, through ES we can avoid this "brute force" technique with the help of analysis process and inverted index. To match "Fox" and "Foxes" at the same time, we can simply use a stemming and then index the stems. So there's no need to make a partial match.
Even so, partial matching is useful in some situations. Common use cases such as:
- Match a ZIP code, product serial number, or other not_analyzed value that starts with a specific prefix or that matches a wildcard character or even a regular expression.
- Instant Search (Search-as-you-type)-Displays the most likely results before the user finishes entering the search term.
- Match languages such as German or Dutch, they have long compound words, such as weltgesundheitsorganisation (World Health Organization).
We started with a prefix match for the exact value not_analyzed field and introduced some matching techniques.
ZIP Code and structured data
We describe how to use partial matching on structured data using the UK postcode. The UK postcode is a clearly defined structure. For example, w1v 3DG This zip code can be broken down into the following sections:
- W1V: This section shows the postal region and region (Postal area and District):
- w indicates the region, using one or two letters. The
- 1V indicates the region (DISTRICT), uses one or two numbers, and may follow a letter.
- 3DG: This section indicates a street or building:
- 3 indicates the region (Sector), using a number. The
- DG indicates the unit, using two letters.
Suppose we index the ZIP code to the not_analyzed field of the exact value, so we can create an index like this:
put/my_index{ "mappings" : { "address" : { "properties" : { "postcode" : { "type" : "string" , "index" : "not_analyzed" } } } }}
Then index some ZIP codes:
put/my_index/address/1{ "postcode" : "w1v 3DG" }put/my_index/address/2{ "postcode" : "w2f 8HW" }put/my_index/address/3{ "postcode" : "w1f 7HW" }put/my_index/address/4{ "postcode" : "wc1n 1LZ" }put/my_index/address/5{ "postcode" : "SW5 0BE" }
Now our data is ready.
Prefix queries (Prefix query)
We can get all the zip codes that start with W1 with a simple prefix query:
get/my_index/address/_search{ " query " : {" prefix< Span class= "Pl-pds" style= "" > " : {" postcode : " w1 " }}
The prefix query is a low-level query that works at entry levels. It does not parse the query string before the search. It assumes that the user will pass in an exact prefix that needs to be queried.
TIP
By default, the prefix query does not calculate the correlation score. It just matches the document, and the score for the matching document is 1. In fact, it's more like a filter than querying it. The only difference between the prefix query and the prefix filter is that the filter can be cached.
Before, we mentioned that "you can only find the terms that exist in the inverted index", but we do not have any special treatment for these postal codes; Each postal code is simply indexed as an exact value. So how does the prefix query work?
Remember that the inverted index consists of an ordered list of unique entries (in this case, the postal code). For each entry, it enumerates all the document IDs that contain the entry. For our sample document, the inverted index looks like this:
Term: Doc IDs:-------------------------"SW5 0BE" | 5"W1F 7HW" | 3"W1V 3DG" | 1"W2F 8HW" | 2"WC1N 1LZ" | 4-------------------------
To support prefix matching, the query performs the following steps:
- Iterate through the list of entries and find the entry that starts with W1.
- Collects the corresponding document ID.
- Move to the next entry.
- If the entry also starts with W1, repeat step 2, or end the operation.
Although the above steps work well for our small example, imagine that when the inverted index contains 1 million ZIP codes that start with W1, the prefix query needs to access 1 million entries to get the results.
The shorter the prefix, the more entries you need to access. If we query the prefix for W instead of the W1 entry, it may match up to 10 million entries.
Attention
Prefix queries and filters are useful for instant (ad-hoc) prefix matching, but you need to be careful when using them. Fields with a small number of entries can be used at will, but they are less scalable and may put too much pressure on your cluster. You can limit their impact on the cluster by using a longer prefix, which can reduce the number of entries that need to be accessed.
Later in this chapter, we will describe an indexing period solution that makes prefix matching more efficient. But first, let's take a look at two related queries: wildcard and RegExp queries.
[Elasticsearch] Partial match (i)-prefix query