[Elasticsearch] control correlation (quad)-Ignore TF/IDF

Source: Internet
Author: User
Tags idf

This chapter is translated from the Elasticsearch official guide Controlling relevance a chapter.


Ignore TF/IDF

Sometimes we don't need tf/idf. All we want to know is whether a particular word appears in the field. For example, we are searching for a resort, and we hope it has more selling points as well:

    • Wifi
    • Gardens (Garden)
    • Pool (Swimming pool)

The documentation for the resort is similar to the following:

"description" ""} 

You can use a simple match query:

get/_search{"  query : {< Span class= "Pl-pds" style= "" > " match"  : { " description"  :  " WiFi garden pool< Span class= "Pl-pds" style= "" > " }}} 

What we need, however, is not a real full-text search. At this point the TF/IDF will only get in. We don't care if WiFi is a common term, or whether it appears frequently in documents. All we care about is whether it appears. In fact, we just want to sort these resorts by selling them-the more the better. If you have a selling point, then its score is 1, if not its score is 0.

Constant_score Query

First, we introduce the Constant_score query. The query can contain a query or a filter, and all matching documents have a correlation score of 1, regardless of TF/IDF:

GET/_search{ "query" : { "bool" : { "should" : [        { "constant_score" : { "query" : { "match" : { "description" : "wifi" }}        }},        { "constant_score" : { "query" : { "match" : { "description" : "garden" }}        }},        { "constant_score" : { "query" : { "match" : { "description" : "pool" }}        }}      ]    }  }}

Probably not all the selling points are equally important-some of them are more valuable. If the most popular selling point is the pool, then we can improve it accordingly:

GET/_search{ "query" : { "bool" : { "should" : [        { "constant_score" : { "query" : { "match" : { "description" : "wifi" }}        }},        { "constant_score" : { "query" : { "match" : { "description" : "garden" }}        }},        { "constant_score" : { "boost" :2            "query" :{ "match" : { "description" : "pool" }}        }}      ]    }  }}

NOTE

The final score for each result is not a summation of the scores of all matching clauses. Coordination factor and query normalization Factor will still be taken into account.

We can add a not_analyzed type of features field to the resort's documentation:

{"  Features : [ " wifi"  , "  pool " , Span class= "pl-s1" style= "Color:rgb (223,80,0)" > " Garden  ]} 

By default, the field length of a not_analyzed field (Field-length Norm) is disabled, and its index_options is also set to Docs, which disables the frequency of entries (term frequencies), But the problem still exists: the frequency of inverted documents per entry (inverse document Frequency) will still be considered.

Still using Constant_score query:

GET/_search{ "query" : { "bool" : { "should" : [        { "constant_score" : { "query" : { "match" : { "features" : "wifi" }}        }},        { "constant_score" : { "query" : { "match" : { "features" : "garden" }}        }},        { "constant_score" : { "boost" :2           "query" :{ "match" : { "features" : "pool" }}        }}      ]    }  }}

In fact, each selling point should be treated as a filter. The resort has either the selling point or not-the use of filters seems to be a more natural choice. And if we use filters, we can also benefit from the filter cache feature.

The root cause of not using filters is that the filter does not calculate the correlation score. What we need is a bridge to connect filters and queries. The Function_score query can do this, and it also provides more functionality.


[Elasticsearch] control correlation (quad)-Ignore TF/IDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.