[Elasticsearch] multi-field search (5)-field-centric Query

Source: Internet
Author: User
Tags idf
Fields-centric queries (Field-centricQueries) All of the above three problems come from the fact that most_fields is Field-centric rather than Term-centric): It will query the most matched fields (Mostmatchingfields), and we are really interested in the most matched entry (Mostmatchingterm

Fields-centric Queries (Field-centric Queries) All of the above three problems come from the fact that most_fields is Field-centric ), instead of the Term-centric (term-centric): It queries the Most matched fields (Most matching fields), and the Most matching Term we are actually interested in

Field-centric query (Field-centric Queries)

All the three problems mentioned above come from most_fields Field-centric rather than Term-centric ): it queries the Most matched fields (Most matching fields), and the Most matched entry (Most matching terms) that we are really interested in ).

NOTE

Best_fields is also field-centered, so it also has similar problems.

First, let's see why these problems exist and how to solve them.

Question 1: match the same word in multiple fields

Consider how the most_fields query is executed: ES will generate a match query for each field and include them in a bool query.

We can pass the query to the validate-query API for viewing:

GET /_validate/query?explain{  "query": {    "multi_match": {      "query":   "Poland Street W1V",      "type":    "most_fields",      "fields":  [ "street", "city", "country", "postcode" ]    }  }}

It will produce the following explanation ):

(Street: poland street: w1v) (city: poland city: street city: w1v) (country: poland country: street country: w1v) (postcode: poland postcode: street postcode: w1v)

You can find that a document that matches poland in two fields has a higher score than a document that matches poland and street in one field.

Question 2: Reduce the long tail

In the Controlling Precision section, we discussed how to use the and operator and the minimum_should_match parameter to reduce the number of documents with low relevance:

{    "query": {        "multi_match": {            "query":       "Poland Street W1V",            "type":        "most_fields",            "operator":    "and",             "fields":      [ "street", "city", "country", "postcode" ]        }    }}

However, if you use best_fields or most_fields, these parameters are passed to the generated match query. The query is interpreted as follows (using the validate-query API ):

(+ Street: poland + street: street + street: w1v) (+ city: poland + city: street + city: w1v) (+ country: poland + country: street + country: w1v) (+ postcode: poland + postcode: street + postcode: w1v)

In other words, when the and operator is used, all words must appear in the same field. This is obviously wrong! In this way, there may be no matching documents.

Question 3: Entry frequency

In What is Relevance, we explain the default similarity algorithm TF/IDF used to calculate the correlation score of each entry:

Term Frequency)

In a document, the more frequently a word appears in a field, the higher the relevance of the document.

Inverted Document Frequency (Inverse Document Frequency)

The more frequently an entry appears in the fields of all documents indexed, the lower the relevance of the entry.

When searching through multiple fields, TF/IDF will produce some surprising results.

Consider using the first_name and last_name fields to search for "Peter Smith. Peter is a common name, and Smith is a common surname-they have low IDF. But what if there is another person named Smith Williams in the index? Smith is very rare as a name, so it has a very high IDF value!

A simple query like the following will put Smith Williams before Peter Smith ), although Peter Smith is clearly a better match:

{    "query": {        "multi_match": {            "query":       "Peter Smith",            "type":        "most_fields",            "fields":      [ "*_name" ]        }    }}

Smith's high idf value in the first_name field will overwhelm peter's two low IDF values in the first_name field and smith's last_name field.

Solution

This problem only exists when we process multiple fields. If we merge all these fields into one field, this problem will no longer exist. We can add a full_name field to the person document for implementation:

{    "first_name":  "Peter",    "last_name":   "Smith",    "full_name":   "Peter Smith"}

When we only query the full_name field:

  • More documents that match a word are better than documents that repeat a word.
  • The minimum_should_match and operator parameters work properly.
  • The frequency of inverted documents of first_name and last_name is merged, So smith is no longer important, whether it is first_name or last_name.

    Although this method can work, we do not want to store redundant data. Therefore, ES provides us with two solutions-one during indexing and the other during searching. They will be discussed in the next section.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.