This chapter is translated from the Elasticsearch official guide Controlling relevance a chapter.
Ignore TF/IDF
Sometimes we don't need tf/idf. All we want to know is whether a particular word appears in the field. For example, we are searching for a resort, and we hope it has more selling points as well:
- Wifi
- Gardens (Garden)
- Pool (Swimming pool)
The documentation for the resort is similar to the following:
"description" ""}
You can use a simple match query:
get/_search{" query : {< Span class= "Pl-pds" style= "" > " match" : { " description" : " WiFi garden pool< Span class= "Pl-pds" style= "" > " }}}
What we need, however, is not a real full-text search. At this point the TF/IDF will only get in. We don't care if WiFi is a common term, or whether it appears frequently in documents. All we care about is whether it appears. In fact, we just want to sort these resorts by selling them-the more the better. If you have a selling point, then its score is 1, if not its score is 0.
Constant_score Query
First, we introduce the Constant_score query. The query can contain a query or a filter, and all matching documents have a correlation score of 1, regardless of TF/IDF:
GET/_search{ "query" : { "bool" : { "should" : [ { "constant_score" : { "query" : { "match" : { "description" : "wifi" }} }}, { "constant_score" : { "query" : { "match" : { "description" : "garden" }} }}, { "constant_score" : { "query" : { "match" : { "description" : "pool" }} }} ] } }}
Probably not all the selling points are equally important-some of them are more valuable. If the most popular selling point is the pool, then we can improve it accordingly:
GET/_search{ "query" : { "bool" : { "should" : [ { "constant_score" : { "query" : { "match" : { "description" : "wifi" }} }}, { "constant_score" : { "query" : { "match" : { "description" : "garden" }} }}, { "constant_score" : { "boost" :2 "query" :{ "match" : { "description" : "pool" }} }} ] } }}
NOTE
The final score for each result is not a summation of the scores of all matching clauses. Coordination factor and query normalization Factor will still be taken into account.
We can add a not_analyzed type of features field to the resort's documentation:
{" Features : [ " wifi" , " pool " , Span class= "pl-s1" style= "Color:rgb (223,80,0)" > " Garden ]}
By default, the field length of a not_analyzed field (Field-length Norm) is disabled, and its index_options is also set to Docs, which disables the frequency of entries (term frequencies), But the problem still exists: the frequency of inverted documents per entry (inverse document Frequency) will still be considered.
Still using Constant_score query:
GET/_search{ "query" : { "bool" : { "should" : [ { "constant_score" : { "query" : { "match" : { "features" : "wifi" }} }}, { "constant_score" : { "query" : { "match" : { "features" : "garden" }} }}, { "constant_score" : { "boost" :2 "query" :{ "match" : { "features" : "pool" }} }} ] } }}
In fact, each selling point should be treated as a filter. The resort has either the selling point or not-the use of filters seems to be a more natural choice. And if we use filters, we can also benefit from the filter cache feature.
The root cause of not using filters is that the filter does not calculate the correlation score. What we need is a bridge to connect filters and queries. The Function_score query can do this, and it also provides more functionality.
[Elasticsearch] control correlation (quad)-Ignore TF/IDF