Once all documents are obtained, the Coordination node returns the results to the client.The Coordination node first determines which documents really need to be obtained. For example, if you specify{ "from": 90, "size": 10 }The first 90 results will be Discarded (Discarded), and only the documents represented by the remaining 10 results need to be obtained. These documents may come from one or more parts.
The Coordination node constructs a multi GET request for each part containing the target document, and then sends the request to the part copy involved in the query phase.
The part will load the body part of the document-that is, the _ source Field-if required, it will also polish the results, such as metadata and Search Snippet Highlighting ). Once the coordination node obtains all the results, it assembles them into a response and returns it to the client.
Deep Pagination)
The Query then Fetch process supportsfromAndsizeParameters to complete the paging function. However, this function has restrictions. Do not forget that each part will save a local sizefrom + sizeAll content in the priority queue must be returned to the Coordination node. Then the coordination node needsnumber_of_shards * (from + size)Documents are sorted to ensure the final result is correct.sizeDocuments.
Based on the document size, the number of parts, and the hardware you are using, you can perform paging (10000 to 50000 pages) on 1000 to 5000 results. However, whenfromTo a certain extent, sorting will become very resource-consuming behaviors such as CPU, memory, and bandwidth. Because of this, we strongly recommend that you do not use deep paging.
In practice, the number of pages returned by in-depth paging is practically impractical. Users often modify their search criteria after browsing 2 to 3 pages. The culprit is often web crawlers that crawl endless pages. They will overload your servers.
If you really need to obtain a large number of documents from the cluster, you can use the scan search type with the sorting function disabled to efficiently complete this task. We will discuss it later in the Scan and Scroll sections.
Search Options)Some search string parameters can affect the search process:
PreferenceThis parameter allows you to control which shards or nodes are used to process search requests. It can accept:_primary,_primary_first,_local,_only_node:xyz,_prefer_node:xyzAnd_shards:2,3Such a value. The meanings of these values are explained in detail in the preference search documentation.
However, the most common value is some arbitrary strings, which are used to avoid the Bouncing Result Problem ).
Result jump (Bouncing Results)
For example, when you usetimestampFields are used to sort the results. Two documents have the same timestamp. Because the search request is processed in a Round-robin manner by available multipart copy, the returned order of the two documents may be different because the parts to be processed are different, for example, the order of primary shard processing and the order of replica shard processing may be different.
This is the result jump problem: every time a user refreshes the page, the order of results is different.
This problem can be avoided by always specifying the same part for the same user:preferenceThe parameter is set to an arbitrary string, such as the user's Session ID ).
TimeoutBy default, the Coordination node waits for the response of all shards. If a node is in trouble, it slows down the response of all search requests.
timeoutThe parameter tells The coordinating node how long it will wait before giving up. If you give up, it will directly return the existing results. Returning a part of results is at least better than returning nothing.
In the response to a search request, it is useful to indicate whether the search times out and how many fields are successfully responded to by multipart:
... "timed_out": true, "_shards": { "total": 5, "successful": 4, "failed": 1 },...
RoutingIn the Routing Document in the Distributed document Store chapter to the segment (Routing a Document to a shard) section, we have explained the customroutingParameters can be provided during indexing to ensure that all relevant documents, such as documents belonging to the same user, are saved on one part. You can specify one or moreroutingTo limit the search range to a specific shard:
GET /_search?routing=user_1,user2
This technology is useful in Designing very large search systems. We will introduce it in detail in the design for scale.
Search_typeBesidesquery_then_fetchIs the default search type, and other search types can meet specific purposes, such:
GET /_search?search_type=count
Count
countThe search type is only available in the query phase. You can use it when you do not need a search result. It only returns the number of matched documents or aggregate results.
Query_and_fetch
query_and_fetchThe search type combines the query and retrieval phases into one phase. This parameter is used when the target of a search request has only one shard. For example, ifroutingIs an internal optimization measure. Although you can choose to use this search type, this is almost useless.
Dfs_query_then_fetch and dfs_query_and_fetch
dfsThe search type has a Pre-query phase (Pre-query phase) used to obtain the Term Frequencies from the related Shard to calculate the frequency of the group's entry. We will discuss this in the Relevance is broken section.
Scan
scanThe search type will matchscrollAPIs are used together to efficiently obtain a large number of results. It is done by disabling sorting. It will be discussed in the next section.
Scan and scrollscanSearch type andscrollAPIs are used together to efficiently obtain a large number of documents from ES without any problems in Deep Pagination.
ScrollA Scroll Search allows us to specify an Initial Search, and then continue to obtain batch results from ES until all results are obtained. This is a bit like Cursor in a traditional database ).
A rolling search generates a real-time Snapshot-it does not find any changes to the index after the initial search. It saves the old data file, so it can save a View that is indexed at the beginning of the data file ).
ScanThe most resource-consuming part of deep paging is to sort global results, but if we disable the sorting function, we can quickly return all documents. We can usescanSearch type. It tells ES not to execute sorting, but to let each part that has returned results return the next batch of results.
To use scan and scroll, we set the search typescanAnd inputscrollParameter to tell ES how long scroll will be available:
GET /old_index/_search?search_type=scan&scroll=1m { "query": { "match_all": {}}, "size": 1000}
The above request will open scroll for one minute.
The response to this request does not contain any results, but it contains_scroll_idIt is a base64-encoded string. Now you can_scroll_idSend_search/scrollTo obtain the first batch of results:
GET /_search/scroll?scroll=1m c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0 NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YWxfaGl0czoxOw==
This request will keep scroll available for one minute._scroll_idYou can pass in the Request body, URL, or query parameter.
Note that we specify?scroll=1m. The expiration time of scroll is refreshed every time the scroll request is executed. Therefore, it only needs to give us enough time to process the current batch of results, instead of all matching documents.
The response to this scroll request contains the first batch of results. Although we specifysize1000. We can actually get more documents.sizeWill be used by each shard, so each batch can get upsize * number_of_primary_shardsDocuments.
NOTE
The scroll request returns a new_scroll_id. Every time we execute the next scroll request, we need to input_scroll_id.
If no result is returned, all matching documents are processed.
TIP
Clients officially provided by ES provide tools for scanning and scroll to encapsulate this function.