ElasticSearch (7)-Sort

Source: Internet
Author: User
Tags idf

Citation from Elaticsearch authoritative guide

First, sort relevance sort

By default, result rallies are sorted by relevance -The higher the correlation, the higher the ranking. In this chapter we will describe what relevance is and how it is calculated. Before we do this, let's take a look at sort how the parameters are used.

Sorting method

In order for the results to be sorted by relevance, we need a value for correlation. In Elasticsearch query results, the correlation score uses _score the field to give a floating-point value, so by default the result set _score is sorted in reverse order.

Sometimes, even so, you don't have a meaningful correlation score. For example, the following statement returns whether all tweets user_id contain values 1 :

GET/_search{    "query": {"filtered": {"filter": {"term                ": {                    "user_id": 1                }            }        }    }}

There is _score no relationship between the filter statement, but there is match_all an implied query condition for _score All of the document's set values 1 . The equivalent of all document dependencies is the same.

sorting field values

In the following example, the result set is sorted by time, which is also the most common case where the latest documents are arranged in front. We use sort parameters to sort:

GET/_search{    "query": {"filtered": {"filter": {"term            ": {"user_id": 1}}        }    ,    "Sort" : {"date": {"order": "Desc"}}}

You will find two different points here:

"hits": {    "total":           6,    "Max_score":       <1>    "hits": [{        "_index":      "Us",        "_type":       "tweet",        "_id":         "+",        "_score":      null, <1>        "_source":     {             "date":    "2014-09-24",             ...        },        "sort":        <2>    },    ...}

The _score field is not evaluated because it is not used as a sort. The date field is converted to milliseconds as a sort order.

First, a sort field is added to each result, and the values it contains are used for sorting.

In this example, the date field is internally converted to milliseconds , that is, the long integer number 1411516800000 equals the date string 2014-09-24 00:00:00 UTC .

The second is that both the _score and max_score fields are null.

Computing _score is more cost-intensive and is often used primarily as a sort-when we don't sort by relevance, we don't need to count its dependencies. If you want to force the calculation of its dependencies, you can set track_scores to true.

default Sort

You can specify only the name of the field you want to sort:

"Sort": "Number_of_children"

Field values are listed in order by default, and _score are sorted in reverse.

Multilevel sorting

If we want to merge a query statement and show all matching result sets using the first sort is date, the second sort is _score:

GET/_search {"Query": {"filtered": {"Query": {"                Match": {                    "tweet": "Manage Text Search"                }< c7/>},            "filter": {"term                ": {                    "user_id": 2}}}    ,    "sort": [        {            " Date ": {                " order ":" desc "            }        },        {            " _score ": {                " order ":" desc "            }        }    ]}

It is important to sort. The result set is sorted by first sort field, when the value used to sort the first field is the same, then the second field sorts the same document with the first sorted value, and so on.

Multilevel sorting does not need to contain _score

You can use several different fields, such as location distance or custom values.

String parameter Ordering

Character queries also support custom sorting, which can be used in a query string using the sort parameter:

GET/_search?sort=date:desc&sort=_score&q=search
Sort multi-valued fields

When you sort multiple values for a field, the values are inherently not fixed --a field with multiple values is a collection, which one do you want to sort by?

For numbers and dates, you can sort by taking one out of multiple values, and you can use min, Max, avg , or sum for these patterns.

Rather than saying you can sort by the earliest date in the dates field:

"Sort": {"dates": {"Order": "ASC", "mode": "Min"}}
String sort multi-value field string sort

Translator Note: 多值字段 refers to the same field in the ES index can have multiple meanings, you can use multiple analyzers (analyser) for word segmentation and sorting, or do not add a parser, the original value.

The characters processed by the parser (analyser) are called analyzed field (the Translator note: The field that has been participle and sorted, all fields written in es default vline will be analyzed) analyzed , and the string field is also a multivalued field. Sorting on these fields often doesn't get the value you want . For example, if you parse a character "fine old art" , it will eventually get three values. For example, we want to sort by the first letter, and if the first word is the same, then the first letter of the second word is sorted, and so on, unfortunately ElasticSearch is not available for sorting.

Of course you can min use max the and pattern to row (the default is to use a min pattern) but it is based on art or old sort of, rather than what we expected .

In order for a string field to be sorted, it must contain only one word: the not_analyzed complete String (the original string without the parser participle and sort).

Of course we need to use analyzed the marked fields for full-text search of the fields.

_source Sorting two times on the same string will cause unnecessary waste of resources. And what we want is to include both of these indexes in the same field, we just need to change the index mapping . The method is to modify the fields mapping using common parameters on all core field types. For example, our original mapping are as follows:

"tweet": {    "type":     "string",    "Analyzer": "中文版"}

The changed multivalued field mapping as follows:

"tweet": {<1>    "type":     "string",    "Analyzer": "中文版",    "fields": {        "raw": {<2 >            "type":  "string",            "index": "Not_analyzed"}}    }

<1> The field is used for analyzed The full-text index in the same way. tweet

<2> New tweet.raw sub-fields are indexed in the same way not_analyzed .

Now, after rebuilding the index, we can either use tweet fields for full-text search or tweet.raw sort by fields:

GET/_search{    "query": {        "match": {            "tweet": "Elasticsearch"        }    }, "    sort": "Tweet.raw" }

Warning:

Forcing analyzed a field to sort will consume a lot of memory. For more information, please refer to "about field type".

Iii. Introduction to relevance and relevance

We've said that by default, the returned results are sorted in reverse order of relevance. But what is relevance? How is correlation calculated?

Each document has a relevance score, which is represented by a relative floating-point number segment _score - _score The higher the score, the higher the correlation.

A query statement adds a _score field to each document. How the score is calculated depends on the different query types -Different query statements are used for different fuzzy purposes: The query calculates the similarity to the spelling terms of the keyword, and the query calculates What is found matches the percentage of the keyword component, but the general meaning of the full-text search refers to the similarity of the content to the keywords.

The similarity algorithm for elasticsearch is defined as TF/IDF, which is the frequency of the search term / reverse document frequency , including the following:

    • Search Term Frequency :: How often does a search term appear in this field ? The higher the frequency of occurrence, the higher the correlation. 5 occurrences in a field are higher than the 1-time correlation.
    • Reverse Document Frequency :: How often does each search term appear in the index? The higher the frequency, the lower the correlation. A search term appears in most documents with a lower weight than in a few documents, which is the general importance of examining a search term in a document.
    • field Length Guidelines :: What is the length of the field? The longer the length, the lower the correlation. The search term appears in a shorter title content field than the same word appears in a long one.

A single query can use the TF/IDF scoring standard or other means, such as the distance of a search term in a phrase query or the similarity of a search term in a fuzzy query.

Relevance is not just a patent for full-text search. Also applies yes|no to clauses, the more matching clauses, the higher the relevance score.

If multiple query clauses are merged into a single compound query statement, such as bool a query, the scores computed for each query clause are combined into the overall relevance score.

Understanding scoring Criteria

When debugging a complex query statement, it is difficult to understand relevance scoring _score . ElasticSearch has a explain parameter in each query statement and will be explain set to true get more detailed information.

GET/_search?explain <1>{   "Query"   : {"Match": {"tweet": "Honeymoon"}}}

explainThe <1> parameter allows you to add a _score score based on the return result.

Adding a explain parameter creates a lot of extra content for each document that is matched, but it makes sense to take the time to understand it. It doesn't matter if you don't see it right now-you can look back at it when you need it. Let's learn a little bit about this knowledge point.

First, let's look at the meta-data returned by the normal query:

{    "_index":      "Us",    "_type":       "tweet",    "_id":         "a",    "_score":      0.076713204 ,    "_source":     {... trimmed ...},}

It is helpful for us to add the information from which node and which shard the document came from, because the word frequency and document frequency are computed in each shard , not in each index:

    "_shard":      1,    "_node":       "mzivycsqswcg_m_zffss9q",

The _explanation return value will then be included in each entry, telling you which calculation to use and letting you know the results of the calculation and other details:

"_explanation": {<1>"description": "Weight (tweet:honeymoon in 0) [perfieldsimilarity], result of:", "value": 0.076713         204, "Details": [{"description": "Fieldweight in 0, product of:", "value": 0.076713204, "Details": [{<2>"description": "TF (freq=1.0), with freq of:", "Value": 1, "Details": [               {"description": "termfreq=1.0", "Value": 1} ]            },            {<3>"description": "IDF (Docfreq=1, Maxdocs=1)", "Value": 0.30685282}, {<4>"description": "Fieldnorm (doc=0)", "value": 0.25,}]}]}

Summary honeymoon of <1> relevance scoring calculation

<2> Search Term Frequency

<3> Reverse Document Frequency

<4> Field Length guidelines

Important :

The explain cost of the output is very expensive, and it can only be used as a debugging tool-never for production environments.

The first part is a summary of the calculation. Tells us the "honeymoon" search tweet term frequency in the field/reverse document frequency or TF/IDF, (here the document 0 is an internal ID, which is not related to us, can be ignored.) )

It then explains how the weight of the calculation is calculated:

Search Term Frequency:

检索词 `honeymoon` 在 `tweet` 字段中的出现次数。

Reverse Document Frequency:

检索词 `honeymoon` 在 `tweet` 字段在当前文档出现次数与索引中其他文档的出现总数的比率。

Field Length guidelines:

文档中 `tweet` 字段内容的长度 -- 内容越长,值越小。

Complex query statement interpretation is also very complex, but contains the same content as the above example. With this description we can see how the search results are generated.

Tips :

The explain description in JSON form is difficult to read, but it's much better to turn into YAML, just addformat=yaml

How the Explain API documentation is matched to

When the explain option is added to a document, it tells you why the document is matched, and why a document has not been matched.

The request path /index/type/id/_explain is as follows:

get/us/tweet/12/_explain{   "Query": {"filtered": {"filter": {"term      ":  {"user_id": 2           }},         "q Uery ":  {" Match ": {" tweet ":   " Honeymoon "}}      }   }

In addition to the full description we see above, we can also see the description:

"Failure to match Filter:cache (user_id:[2 to 2])"

That is, user_id our filter clause makes the document unable to match.

Iv. data fields

The purpose of this chapter is to introduce some of the operating conditions within the elasticsearch. Here we do not introduce new knowledge points, data fields are one of the things we want to review frequently, but we do not need to be too concerned about the use of.

When you sort a field, ElasticSearch needs to enter each matching document to get the relevant value. inverted indexes are excellent for searching, but they are not an ideal sort structure .

    • When searching, we need to use the search term to traverse all the documents.

    • When sorting, we need to traverse all the values in the document, and we need to do the order.

To improve sorting efficiency, ElasticSearch loads the values of all fields into memory, which is called a "data field."

important : Elasticsearch loads all field data into memory that is not the part of the data that matches. Instead , the values in all documents under the index, including all types .

All field data is loaded into memory because the reverse inverted index from the hard disk is very slow. Although you need some of the data in some documents for this request, you need additional data for the next request, so it is necessary to load all the field data into memory at once.

The field data in Elasticsearch is often applied to the following scenarios:

    • Sort a field
    • To aggregate a field
    • Some filtering, such as geo-filtering
    • Some script calculations related to fields

There is no doubt that this consumes a lot of memory , especially a lot of character string data--the String field may contain many different values, such as the message content. Fortunately, the lack of memory can be solved by scaling out, we can add more nodes to the cluster.

Now you just need to know what the field data is and when it's out of memory. We'll tell you later how much memory is consumed by the field data, how to limit the memory that Elasticsearch can use, and how to preload field data to improve the user experience.

ElasticSearch (7)-Sort

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.