How distributed search is performed
Before we go on, we'll take a detour about how the search is performed in a distributed environment. It is more complicated than the base additions and deletions (create-read-update-delete , CRUD) Requests we have previously talked about.
Note:
The information in this chapter is read only for interest, and using elasticsearch does not require understanding and remembering all the details here.
Reading this chapter just increases the understanding of how the system works and lets you know that information for future reference, so don't drown in the details.
A CRUD operation only processes a single document. The uniqueness of the document _index
_type
routing-value
is determined by a combination of, and (usually by default, the document _id
). This means that we can know exactly which shard in the cluster holds this document.
Because you do not know which document will match the query (the document may reside on any shard in the cluster), the search requires a more complex model. A search has to see if any matching documents are included by querying each shard copy of the index that we are interested in.
However, finding all the matching documents only completes half of the story. Before the search search
() API returns a page of results, the results from multiple shards must be grouped into a sequential list. Therefore, the execution of the search is performed in two stages, called a query and then retrieved (query thenfetch).
First, the query phase
During the initialization of query phase, the query is broadcast to each shard copy (original or copy) in the index. Each shard performs a search locally and establishes a priority queue that matches the document .
Priority Queue
A priority queue is just an ordered list of the top n (top-n) matching document. The size of this priority queue is determined by the paging parameter from and size. For example, the search request in the following example requires that a priority queue be able to hold 100 document
GET/_search{ "from": $, "size": 10}
The process of this query is described in the graph Distributed search query phase.
Figure 1 Distributed search query phase
The query phase consists of the following three steps:
- The client sends a
search(搜索)
request Node 3
to Node 3
create a from+size
queue with a length of empty priority.
Node 3
Forwards this search request to the original or copy of each Shard in the index. Each shard executes the query locally and results in an from+size
ordered local priority queue of size.
- Each shard returns the ID of the document and the sort value of all the document in its priority queue to the coordination node
Node 3
. Node 3
merging these values into their own priority queue produces a global ordering result .
When a search request is sent to a node, the node becomes the coordination node . The work of this node is to broadcast search requests to all relevant shards and integrate their responses into a global ordered result set. This result rally is returned to the client.
The first step is to broadcast the request to the Shard copy of each node in the index. Just like the GET
document request, the search request can be processed by the original or any copy of each shard. This is how more replicas (when combined with more hardware) can improve the throughput of the search. For subsequent requests, the coordination node polls all the Shard replicas to distribute the load .
Each shard executes the query locally and establishes an ordered priority queue from+size
of length--a length that means that its own number of results is sufficient to meet the requirements of the global request. The Shard returns a lightweight list of results to the coordination node. Contains only the values that are required for documentid values and sorting, _score
for example.
The coordination node merges the results of these shard levels into its own ordered priority queue. This represents the final, globally ordered result set. Here, the query phase is over.
The whole process is similar to the merge sort algorithm, which is very suitable for this distributed scenario, which is grouped and then merged together.
Note
An index can consist of one or more original shards, so a search request for a single index also needs to be able to combine the results from multiple shards. a search for multiple (multiple) or All indexes is completely consistent with this --just a few more shards.
Second, the retrieval phase
The query phase identifies the document that satisfies the search request, but we still need to retrieve the document itself. This is the work of the retrieval phase, as shown in the retrieval phase of the distributed search.
Figure 2 The distributed search retrieval phase
The distribution phase consists of the following steps:
- The coordination node identifies which document needs to be retrieved and makes a request to the relevant Shard
GET
.
- Each shard loads the document and enriches (enrich) them as needed before returning the document to the Reconcile node.
- Once all the document has been retrieved, the coordination node returns the result to the client.
The coordination node first determines which document is actual (actually) needs to be retrieved. For example, if we { "from": 90, "size": 10 }
Specify a query, the first 90 will be discarded and only the next 10 will need to be retrieved. These document may come from one, some, or all of the shards associated with the original query request.
The Coordination node establishes a multipoint get request for each shard holding the relevant document and then sends the request to the Shard copy that processes the query phase .
Shard loads the document body _source
-field. If needed, it also enriches results and highlights search fragments based on metadata. Once the coordination nodes receive all the results, they will be pooled into a single response, and the response will be returned to the client.
Deep pagination
The query and retrieval process supports paging through usage from
and size
parameters, but within a limited range (within limited). Remember that each shard must construct a from+size
priority queue of length, all of which are passed back to the coordination node. This means that the coordination node will 分片数量 * (from + size)
find the correct document by sorting through the document size
.
Depending on the number of document, the number of shards and the hardware used, it is feasible to have 10,000 to 50,000 results (1,000 to 5,000 pages) of deep paging. But for large enough from
values, the sorting process will become very onerous, using huge amounts of CPU, memory, and bandwidth . Therefore, it is strongly not recommended to use deep paging.
In practice, "deep paging" is also a very small number of people. The average person will stop turning pages after two or three pages, and will change the search criteria. Those abnormal situations are usually the behavior of robots or web crawlers. They continue to fetch pages one page after another to the server to the edge of the crash.
If you really need to get a lot of documents from the cluster, you can do scan
This efficiently by setting the search type to disable sorting. This will be discussed in a later section.
Third, search options
Some query string (query-string) optional parameters can affect the search process.
1. Preference (preference)
preference
Parameters allow you to control which shards or nodes are used to process search requests . She accepts some of the following parameters,,, _primary
_primary_first
_local
_only_node:xyz
_prefer_node:xyz
and _shards:2,3
. These parameters are described in detail in the document search preferences (preference).
However, usually the most useful values are some random strings, which can avoid the result oscillation problem (the bouncing results problem).
Result concussion (bouncing Results)
Imagine that you are timestamp
sorting your results by field and have two document with the same timestamp. Since the search request is polled across all valid Shard replicas, the two document may be in the original Shard in one order and in the replica shard in another order.
This is the problem known as the bouncing results : Each time the user refreshes the page, the order of the results changes . The way to avoid this problem is to always use the same shard for the same user. The method is to set the parameters using a random string, such as the user's session ID preference
.
2. Timeout (timeout)
Typically, the coordination node waits for the answer to receive all shards. If one of the nodes encounters a problem, it slows down the entire search request.
timeout
Parameters tell the coordination node how long it will wait, and you can discard the wait and return the existing results. Returning part of the result is better than nothing.
The return of the search request will indicate whether the search timed out, and how many shards have successfully replied:
... " Timed_out ": True, (1) " _shards ": { " total ": 5, " successful ": 4, " failed ": 1 (2) }, ...
(1) The search request timed out.
(2) One of five shards does not reply within the timeout period.
If all replicas of a shard fail for other reasons-perhaps because of a hardware failure-this is also reflected in the part of the reply _shards
.
3. Routing (route selection)
In the route Values section, we explain how to provide a custom parameter when indexing routing
to ensure that all relevant document (such as a document belonging to a single user) is stored in a separate shard. When searching, you can specify one or more routing
values to limit the search for only those shards instead of searching all the shards in index:
GET/_search?routing=user_1,user2
This technique comes in handy when designing a very large search system. We discuss it in detail in the chapter on scale.
4. Search_type (search type)
Although query_then_fetch
it is the default search type, other search types can be specified for specific purposes, for example:
GET/_search?search_type=count
Count (Count)
count(计数)
The search type has only one query(查询)
stage. You can use this query type when you do not need search results to know only the number of document that satisfies the query.
Query_and_fetch (Query and retrieve)
query_and_fetch(查询并且取回)
The search type merges the query and retrieval phases into one step . This is an internal optimization option that can be used when the target of a search request is only a shard, such as when a value is specified routing(路由选择)
. Although you can manually choose to use this type of search, it's basically not going to work.
dfs_query_then_fetch and dfs_query_and_fetch
dfs
The search type has a pre-query stage that retrieves the item frequency from all the related shards to calculate the global project frequency. We will discuss this further in Relevance-is-broken (the correlation is destroyed).
Scan (scanning)
scan(扫描)
Search types are scroll(滚屏)
used in conjunction with APIs to efficiently retrieve large numbers of results. It is implemented by disabling sorting . We'll discuss it in the next section scan-and-scroll (Scanning and scrolling) .
Four, scan and roll screen
scan(扫描)
The search type is scroll(滚屏)
used in conjunction with the API to efficiently retrieve large amounts of results from Elasticsearch without the cost of deep paging.
Scroll (Roll screen)
A scrolling search allows us to do an initial phase search and continue to batch the results from Elasticsearch until there is no result left. This is a bit like a cursors (cursor)in a traditional database.
Scrolling search will make a snapshot in time. This snapshot does not contain any modifications to index after the initial phase of the search request. It keeps the old data file handy, so you can protect the index from looking like it was at the beginning of the search.
Scan (scanning)
The most expensive part of deep paging is the global ordering of the results, but if you disable sorting, you can get all the results back at a very low cost. To achieve this, a search pattern can be used scan(扫描)
. Scan mode allows Elasticsearch to be sorted and returns a batch of results as long as the result in the Shard can be returned.
In order to use Scan-and-scroll (scan and scroll), a search request needs to be search_type
performed, scan
set to, and scroll
pass A parameter to tell the Elasticsearch how long the scroll should last.
Get/old_index/_search?search_type=scan&scroll=1m (1) { "query": {"Match_all": {}}, "size": 1000 }
(1) Keep the roll screen open for 1 minutes.
The answer to this request does not contain any hits, but it contains a Base-64 encoded _scroll_id(滚屏id)
string. Now we can _scroll_id
pass to the _search/scroll
end to get the first batch of results:
GET/_search/scroll?scroll=1m (1) c2nhbjs1ozexodprnv9ay1vyuvm4u0nmd2pjwlj3ywlbozexotprnv9ay1vyuvm4u0 <2 > Nmd2pjwlj3ywlbozexnjprnv9ay1vyuvm4u0nmd2pjwlj3ywlbozexnzprnv9ay1vyuvm4u0nmd2pjwlj3ywlbozeymdprnv9ay1vyuvm4u0nmd2pjwlj3ywl boze7dg90ywxfagl0czoxow==
(1) Keep the scroll screen open for another minute.
(2) _scroll_id
can be passed in the body or URL, can also be passed as a query parameter.
Note that you want to specify again ?scroll=1m
. The end time of the scroll screen is refreshed every time we perform a scroll request, so he just needs to give us enough time to process the results of the current batch instead of all the document matching the query.
The answer to the scrolling request contains the results of the first batch. Although a 1000 is specified size
, more document is obtained. When scanning, it size
is applied to each shard, so we have a maximum or a document in each batch size * number_of_primary_shards(size*主分片数)
.
Note:
The scroll request will also return a new one _scroll_id
. Each time you make the next scroll request, you must pass the previous request back _scroll_id
.
If no more hit results are returned, all hit-matching document is processed.
tips:
Some Elasticsearch official clients provide a small assistant to scan and roll the screen . The small helper provides a simple package for this function.
ElasticSearch (8)-Distributed search