Field-centric queries (Field-centric Queries)
The three questions mentioned above are derived from the Most_fields field-centric (field-centric) rather than the entry-centric (term-centric): It queries the most matched fields (most matching field), And we are really interested in the most matching entry (most matching terms).
NOTE
Best_fields is also field-centric, so it has similar problems.
First, let's see why these problems exist and how to solve them.
Issue 1: Match the same word in multiple fields
Consider how the Most_fields query is executed: ES generates a match query for each field, allowing them to be included in a bool query.
We can pass the query into the Validate-query API for viewing:
GET/_validate/query?explain{ "query" : { "multi_match" : { "query" : "Poland Street w1v" , "type" : "most_fields" , "fields" : [ "Street" , "City" , "country" , "postcode" ] } }}
It will produce the following explanation (explaination):
(Street:poland street:street street:w1v) (City:poland city:street city:w1v) (Country:poland country:street country:w1v) (Postcode:poland postcode:street postcode:w1v)
You can find that documents that match Poland in two fields are higher than the scores of documents that match Poland and street in one field.
Issue 2: Reducing long tails
In the section on Precision control (controlling Precision), we discussed how to use the AND operator and the Minimum_should_match parameter to reduce the number of documents with low correlation:
{ "query" : { "multi_match" : { "query" : "Poland Street w1v" , "type" : "most_fields" , "operator" : "and" , "fields" : [ "Street" , "City" , "country" , "postcode" ] } }}
However, using Best_fields or most_fields, these parameters are passed to the generated match query. The query is interpreted as follows (via the Validate-query API):
(+street:poland +street:street +street:w1v) (+city:poland +city:street +city:w1v) (+country:poland +country:street +country:w1v) (+postcode:poland +postcode:street +postcode:w1v)
In other words, when using the AND operator, all words need to appear in the same field, which is obviously wrong! Doing so may not have any matching documents.
Question 3: Frequency of entry
In the section on what relevance (what is relevance), we explain the similarity algorithm TF/IDF, which is used by default to calculate the correlation score for each entry:
Frequency of entry (term Frequency)
在一份文档中,一个词条在一个字段中出现的越频繁,文档的相关度就越高。
Frequency of inverted documents (inverse document Frequency)
一个词条在索引的所有文档的字段中出现的越频繁,词条的相关度就越低。
When searching through multiple fields, TF/IDF produces some surprising results.
Consider using the First_Name and Last_Name fields to search for "Peter Smith" examples. Peter is a common name, and Smith is a common surname-their IDF are lower. But what if there is another person named Smith Williams in the index? Smith is very rare as a name, so its IDF value will be very high!
A simple query like the one below will put Smith Williams in front of Peter Smith (the document with Smith Williams has a higher score than the one containing Peter Smith), although Peter Smith is clearly a better match:
{ "query" : { "multi_match" : { "query" : "Peter Smith" , "type" : "most_fields" , "fields" : [ "*_name" ] } }}
Smith's high IDF value in the First_Name field will overwhelm Peter's two low IDF values in the First_Name field and Smith in the Last_Name field.
Solution Solutions
This problem only exists when we are dealing with multiple fields. If we merge all of these fields into a single field, the problem will no longer exist. We can add a full_name field to the person document to implement:
{ "first_name" : "Peter" , "last_name" : "Smith" , "full_name" : "Peter Smith" }
When we only query the Full_name field:
- Documents that have more matching words than those that repeat a word.
- The Minimum_should_match and operator parameters work correctly.
- First_Name and Last_Name's inverted document frequency is merged, so Smith is no longer important either first_name or last_name.
Although this approach works, we do not want to store redundant data. Therefore, ES provides us with two solutions-one during the indexing period and one during the search. They are discussed in the next section.
[Elasticsearch] Multi-field search (v)-field-centric queries