[Elasticsearch] Multi-field search (v)-field-centric queries

Last Update:2014-12-11 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Field-centric queries (Field-centric Queries)

The three questions mentioned above are derived from the Most_fields field-centric (field-centric) rather than the entry-centric (term-centric): It queries the most matched fields (most matching field), And we are really interested in the most matching entry (most matching terms).

NOTE

Best_fields is also field-centric, so it has similar problems.

First, let's see why these problems exist and how to solve them.

Issue 1: Match the same word in multiple fields

Consider how the Most_fields query is executed: ES generates a match query for each field, allowing them to be included in a bool query.

We can pass the query into the Validate-query API for viewing:

GET/_validate/query?explain{ "query" : { "multi_match" : { "query" : "Poland Street w1v" , "type" : "most_fields" , "fields" :  [ "Street" , "City" , "country" , "postcode" ]    }  }}

It will produce the following explanation (explaination):

(Street:poland street:street street:w1v) (City:poland city:street city:w1v) (Country:poland country:street country:w1v) (Postcode:poland postcode:street postcode:w1v)

You can find that documents that match Poland in two fields are higher than the scores of documents that match Poland and street in one field.

Issue 2: Reducing long tails

In the section on Precision control (controlling Precision), we discussed how to use the AND operator and the Minimum_should_match parameter to reduce the number of documents with low correlation:

{ "query" : { "multi_match" : { "query" : "Poland Street w1v" , "type" : "most_fields" , "operator" : "and" , "fields" :      [ "Street" , "City" , "country" , "postcode" ]        }    }}

However, using Best_fields or most_fields, these parameters are passed to the generated match query. The query is interpreted as follows (via the Validate-query API):

(+street:poland +street:street +street:w1v) (+city:poland +city:street +city:w1v) (+country:poland +country:street +country:w1v) (+postcode:poland +postcode:street +postcode:w1v)

In other words, when using the AND operator, all words need to appear in the same field, which is obviously wrong! Doing so may not have any matching documents.

Question 3: Frequency of entry

In the section on what relevance (what is relevance), we explain the similarity algorithm TF/IDF, which is used by default to calculate the correlation score for each entry:

Frequency of entry (term Frequency)

在一份文档中，一个词条在一个字段中出现的越频繁，文档的相关度就越高。

Frequency of inverted documents (inverse document Frequency)

一个词条在索引的所有文档的字段中出现的越频繁，词条的相关度就越低。

When searching through multiple fields, TF/IDF produces some surprising results.

Consider using the First_Name and Last_Name fields to search for "Peter Smith" examples. Peter is a common name, and Smith is a common surname-their IDF are lower. But what if there is another person named Smith Williams in the index? Smith is very rare as a name, so its IDF value will be very high!

A simple query like the one below will put Smith Williams in front of Peter Smith (the document with Smith Williams has a higher score than the one containing Peter Smith), although Peter Smith is clearly a better match:

{ "query" : { "multi_match" : { "query" : "Peter Smith" , "type" : "most_fields" , "fields" :      [ "*_name" ]        }    }}

Smith's high IDF value in the First_Name field will overwhelm Peter's two low IDF values in the First_Name field and Smith in the Last_Name field.

Solution Solutions

This problem only exists when we are dealing with multiple fields. If we merge all of these fields into a single field, the problem will no longer exist. We can add a full_name field to the person document to implement:

{ "first_name" : "Peter" , "last_name" : "Smith" , "full_name" : "Peter Smith" }

When we only query the Full_name field:

Documents that have more matching words than those that repeat a word.
The Minimum_should_match and operator parameters work correctly.
First_Name and Last_Name's inverted document frequency is merged, so Smith is no longer important either first_name or last_name.

Although this approach works, we do not want to store redundant data. Therefore, ES provides us with two solutions-one during the indexing period and one during the search. They are discussed in the next section.

[Elasticsearch] Multi-field search (v)-field-centric queries

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More