Fields-centric queries (Field-centricQueries) All of the above three problems come from the fact that most_fields is Field-centric rather than Term-centric): It will query the most matched fields (Mostmatchingfields), and we are really interested in the most matched entry (Mostmatchingterm
Fields-centric Queries (Field-centric Queries) All of the above three problems come from the fact that most_fields is Field-centric ), instead of the Term-centric (term-centric): It queries the Most matched fields (Most matching fields), and the Most matching Term we are actually interested in
Field-centric query (Field-centric Queries)
All the three problems mentioned above come from most_fields Field-centric rather than Term-centric ): it queries the Most matched fields (Most matching fields), and the Most matched entry (Most matching terms) that we are really interested in ).
NOTE
Best_fields is also field-centered, so it also has similar problems.
First, let's see why these problems exist and how to solve them.
Question 1: match the same word in multiple fields
Consider how the most_fields query is executed: ES will generate a match query for each field and include them in a bool query.
We can pass the query to the validate-query API for viewing:
GET /_validate/query?explain{ "query": { "multi_match": { "query": "Poland Street W1V", "type": "most_fields", "fields": [ "street", "city", "country", "postcode" ] } }}
It will produce the following explanation ):
(Street: poland street: w1v) (city: poland city: street city: w1v) (country: poland country: street country: w1v) (postcode: poland postcode: street postcode: w1v)
You can find that a document that matches poland in two fields has a higher score than a document that matches poland and street in one field.
Question 2: Reduce the long tail
In the Controlling Precision section, we discussed how to use the and operator and the minimum_should_match parameter to reduce the number of documents with low relevance:
{ "query": { "multi_match": { "query": "Poland Street W1V", "type": "most_fields", "operator": "and", "fields": [ "street", "city", "country", "postcode" ] } }}
However, if you use best_fields or most_fields, these parameters are passed to the generated match query. The query is interpreted as follows (using the validate-query API ):
(+ Street: poland + street: street + street: w1v) (+ city: poland + city: street + city: w1v) (+ country: poland + country: street + country: w1v) (+ postcode: poland + postcode: street + postcode: w1v)
In other words, when the and operator is used, all words must appear in the same field. This is obviously wrong! In this way, there may be no matching documents.
Question 3: Entry frequency
In What is Relevance, we explain the default similarity algorithm TF/IDF used to calculate the correlation score of each entry:
Term Frequency)
In a document, the more frequently a word appears in a field, the higher the relevance of the document.
Inverted Document Frequency (Inverse Document Frequency)
The more frequently an entry appears in the fields of all documents indexed, the lower the relevance of the entry.
When searching through multiple fields, TF/IDF will produce some surprising results.
Consider using the first_name and last_name fields to search for "Peter Smith. Peter is a common name, and Smith is a common surname-they have low IDF. But what if there is another person named Smith Williams in the index? Smith is very rare as a name, so it has a very high IDF value!
A simple query like the following will put Smith Williams before Peter Smith ), although Peter Smith is clearly a better match:
{ "query": { "multi_match": { "query": "Peter Smith", "type": "most_fields", "fields": [ "*_name" ] } }}
Smith's high idf value in the first_name field will overwhelm peter's two low IDF values in the first_name field and smith's last_name field.
Solution
This problem only exists when we process multiple fields. If we merge all these fields into one field, this problem will no longer exist. We can add a full_name field to the person document for implementation:
{ "first_name": "Peter", "last_name": "Smith", "full_name": "Peter Smith"}
When we only query the full_name field:
- More documents that match a word are better than documents that repeat a word.
- The minimum_should_match and operator parameters work properly.
- The frequency of inverted documents of first_name and last_name is merged, So smith is no longer important, whether it is first_name or last_name.
Although this method can work, we do not want to store redundant data. Therefore, ES provides us with two solutions-one during indexing and the other during searching. They will be discussed in the next section.