Citation from Elaticsearch authoritative guide
First, sort relevance sort
By default, result rallies are sorted by relevance -The higher the correlation, the higher the ranking. In this chapter we will describe what relevance is and how it is calculated. Before we do this, let's take a look at sort
how the parameters are used.
Sorting method
In order for the results to be sorted by relevance, we need a value for correlation. In Elasticsearch query results, the correlation score uses _score
the field to give a floating-point value, so by default the result set _score
is sorted in reverse order.
Sometimes, even so, you don't have a meaningful correlation score. For example, the following statement returns whether all tweets user_id
contain values 1
:
GET/_search{ "query": {"filtered": {"filter": {"term ": { "user_id": 1 } } } }}
There is _score
no relationship between the filter statement, but there is match_all
an implied query condition for _score
All of the document's set values 1
. The equivalent of all document dependencies is the same.
sorting field values
In the following example, the result set is sorted by time, which is also the most common case where the latest documents are arranged in front. We use sort
parameters to sort:
GET/_search{ "query": {"filtered": {"filter": {"term ": {"user_id": 1}} } , "Sort" : {"date": {"order": "Desc"}}}
You will find two different points here:
"hits": { "total": 6, "Max_score": <1> "hits": [{ "_index": "Us", "_type": "tweet", "_id": "+", "_score": null, <1> "_source": { "date": "2014-09-24", ... }, "sort": <2> }, ...}
The _score field is not evaluated because it is not used as a sort. The date field is converted to milliseconds as a sort order.
First, a sort field is added to each result, and the values it contains are used for sorting.
In this example, the date field is internally converted to milliseconds , that is, the long integer number 1411516800000 equals the date string 2014-09-24 00:00:00 UTC .
The second is that both the _score and max_score fields are null.
Computing _score is more cost-intensive and is often used primarily as a sort-when we don't sort by relevance, we don't need to count its dependencies. If you want to force the calculation of its dependencies, you can set track_scores to true.
default Sort
You can specify only the name of the field you want to sort:
"Sort": "Number_of_children"
Field values are listed in order by default, and _score are sorted in reverse.
Multilevel sorting
If we want to merge a query statement and show all matching result sets using the first sort is date, the second sort is _score:
GET/_search {"Query": {"filtered": {"Query": {" Match": { "tweet": "Manage Text Search" }< c7/>}, "filter": {"term ": { "user_id": 2}}} , "sort": [ { " Date ": { " order ":" desc " } }, { " _score ": { " order ":" desc " } } ]}
It is important to sort. The result set is sorted by first sort field, when the value used to sort the first field is the same, then the second field sorts the same document with the first sorted value, and so on.
Multilevel sorting does not need to contain _score
You can use several different fields, such as location distance or custom values.
String parameter Ordering
Character queries also support custom sorting, which can be used in a query string using the sort parameter:
GET/_search?sort=date:desc&sort=_score&q=search
Sort multi-valued fields
When you sort multiple values for a field, the values are inherently not fixed --a field with multiple values is a collection, which one do you want to sort by?
For numbers and dates, you can sort by taking one out of multiple values, and you can use min, Max, avg , or sum for these patterns.
Rather than saying you can sort by the earliest date in the dates field:
"Sort": {"dates": {"Order": "ASC", "mode": "Min"}}
String sort multi-value field string sort
Translator Note: 多值字段
refers to the same field in the ES index can have multiple meanings, you can use multiple analyzers (analyser) for word segmentation and sorting, or do not add a parser, the original value.
The characters processed by the parser (analyser) are called analyzed field
(the Translator note: The field that has been participle and sorted, all fields written in es default vline will be analyzed) analyzed
, and the string field is also a multivalued field. Sorting on these fields often doesn't get the value you want . For example, if you parse a character "fine old art"
, it will eventually get three values. For example, we want to sort by the first letter, and if the first word is the same, then the first letter of the second word is sorted, and so on, unfortunately ElasticSearch is not available for sorting.
Of course you can min
use max
the and pattern to row (the default is to use a min
pattern) but it is based on art
or old
sort of, rather than what we expected .
In order for a string field to be sorted, it must contain only one word: the not_analyzed
complete String (the original string without the parser participle and sort).
Of course we need to use analyzed
the marked fields for full-text search of the fields.
_source
Sorting two times on the same string will cause unnecessary waste of resources. And what we want is to include both of these indexes in the same field, we just need to change the index mapping . The method is to modify the fields
mapping using common parameters on all core field types. For example, our original mapping are as follows:
"tweet": { "type": "string", "Analyzer": "中文版"}
The changed multivalued field mapping as follows:
"tweet": {<1> "type": "string", "Analyzer": "中文版", "fields": { "raw": {<2 > "type": "string", "index": "Not_analyzed"}} }
<1> The field is used for analyzed
The full-text index in the same way. tweet
<2> New tweet.raw
sub-fields are indexed in the same way not_analyzed
.
Now, after rebuilding the index, we can either use tweet
fields for full-text search or tweet.raw
sort by fields:
GET/_search{ "query": { "match": { "tweet": "Elasticsearch" } }, " sort": "Tweet.raw" }
Warning:
Forcing analyzed
a field to sort will consume a lot of memory. For more information, please refer to "about field type".
Iii. Introduction to relevance and relevance
We've said that by default, the returned results are sorted in reverse order of relevance. But what is relevance? How is correlation calculated?
Each document has a relevance score, which is represented by a relative floating-point number segment _score
- _score
The higher the score, the higher the correlation.
A query statement adds a _score
field to each document. How the score is calculated depends on the different query types -Different query statements are used for different fuzzy
purposes: The query calculates the similarity to the spelling terms
of the keyword, and the query calculates What is found matches the percentage of the keyword component, but the general meaning of the full-text search refers to the similarity of the content to the keywords.
The similarity algorithm for elasticsearch is defined as TF/IDF, which is the frequency of the search term / reverse document frequency , including the following:
- Search Term Frequency :: How often does a search term appear in this field ? The higher the frequency of occurrence, the higher the correlation. 5 occurrences in a field are higher than the 1-time correlation.
- Reverse Document Frequency :: How often does each search term appear in the index? The higher the frequency, the lower the correlation. A search term appears in most documents with a lower weight than in a few documents, which is the general importance of examining a search term in a document.
- field Length Guidelines :: What is the length of the field? The longer the length, the lower the correlation. The search term appears in a shorter
title
content
field than the same word appears in a long one.
A single query can use the TF/IDF scoring standard or other means, such as the distance of a search term in a phrase query or the similarity of a search term in a fuzzy query.
Relevance is not just a patent for full-text search. Also applies yes|no
to clauses, the more matching clauses, the higher the relevance score.
If multiple query clauses are merged into a single compound query statement, such as bool
a query, the scores computed for each query clause are combined into the overall relevance score.
Understanding scoring Criteria
When debugging a complex query statement, it is difficult to understand relevance scoring _score
. ElasticSearch has a explain parameter in each query statement and will be explain
set to true
get more detailed information.
GET/_search?explain <1>{ "Query" : {"Match": {"tweet": "Honeymoon"}}}
explain
The <1> parameter allows you to add a _score
score based on the return result.
Adding a explain
parameter creates a lot of extra content for each document that is matched, but it makes sense to take the time to understand it. It doesn't matter if you don't see it right now-you can look back at it when you need it. Let's learn a little bit about this knowledge point.
First, let's look at the meta-data returned by the normal query:
{ "_index": "Us", "_type": "tweet", "_id": "a", "_score": 0.076713204 , "_source": {... trimmed ...},}
It is helpful for us to add the information from which node and which shard the document came from, because the word frequency and document frequency are computed in each shard , not in each index:
"_shard": 1, "_node": "mzivycsqswcg_m_zffss9q",
The _explanation
return value will then be included in each entry, telling you which calculation to use and letting you know the results of the calculation and other details:
"_explanation": {<1>"description": "Weight (tweet:honeymoon in 0) [perfieldsimilarity], result of:", "value": 0.076713 204, "Details": [{"description": "Fieldweight in 0, product of:", "value": 0.076713204, "Details": [{<2>"description": "TF (freq=1.0), with freq of:", "Value": 1, "Details": [ {"description": "termfreq=1.0", "Value": 1} ] }, {<3>"description": "IDF (Docfreq=1, Maxdocs=1)", "Value": 0.30685282}, {<4>"description": "Fieldnorm (doc=0)", "value": 0.25,}]}]}
Summary honeymoon
of <1> relevance scoring calculation
<2> Search Term Frequency
<3> Reverse Document Frequency
<4> Field Length guidelines
Important :
The explain
cost of the output is very expensive, and it can only be used as a debugging tool-never for production environments.
The first part is a summary of the calculation. Tells us the "honeymoon"
search tweet
term frequency in the field/reverse document frequency or TF/IDF, (here the document 0
is an internal ID, which is not related to us, can be ignored.) )
It then explains how the weight of the calculation is calculated:
Search Term Frequency:
检索词 `honeymoon` 在 `tweet` 字段中的出现次数。
Reverse Document Frequency:
检索词 `honeymoon` 在 `tweet` 字段在当前文档出现次数与索引中其他文档的出现总数的比率。
Field Length guidelines:
文档中 `tweet` 字段内容的长度 -- 内容越长,值越小。
Complex query statement interpretation is also very complex, but contains the same content as the above example. With this description we can see how the search results are generated.
Tips :
The explain description in JSON form is difficult to read, but it's much better to turn into YAML, just addformat=yaml
How the Explain API documentation is matched to
When the explain
option is added to a document, it tells you why the document is matched, and why a document has not been matched.
The request path /index/type/id/_explain
is as follows:
get/us/tweet/12/_explain{ "Query": {"filtered": {"filter": {"term ": {"user_id": 2 }}, "q Uery ": {" Match ": {" tweet ": " Honeymoon "}} } }
In addition to the full description we see above, we can also see the description:
"Failure to match Filter:cache (user_id:[2 to 2])"
That is, user_id
our filter clause makes the document unable to match.
Iv. data fields
The purpose of this chapter is to introduce some of the operating conditions within the elasticsearch. Here we do not introduce new knowledge points, data fields are one of the things we want to review frequently, but we do not need to be too concerned about the use of.
When you sort a field, ElasticSearch needs to enter each matching document to get the relevant value. inverted indexes are excellent for searching, but they are not an ideal sort structure .
When searching, we need to use the search term to traverse all the documents.
When sorting, we need to traverse all the values in the document, and we need to do the order.
To improve sorting efficiency, ElasticSearch loads the values of all fields into memory, which is called a "data field."
important : Elasticsearch loads all field data into memory that is not the part of the data that matches. Instead , the values in all documents under the index, including all types .
All field data is loaded into memory because the reverse inverted index from the hard disk is very slow. Although you need some of the data in some documents for this request, you need additional data for the next request, so it is necessary to load all the field data into memory at once.
The field data in Elasticsearch is often applied to the following scenarios:
- Sort a field
- To aggregate a field
- Some filtering, such as geo-filtering
- Some script calculations related to fields
There is no doubt that this consumes a lot of memory , especially a lot of character string data--the String field may contain many different values, such as the message content. Fortunately, the lack of memory can be solved by scaling out, we can add more nodes to the cluster.
Now you just need to know what the field data is and when it's out of memory. We'll tell you later how much memory is consumed by the field data, how to limit the memory that Elasticsearch can use, and how to preload field data to improve the user experience.
ElasticSearch (7)-Sort