[Elasticsearch] distributed search

Source: Internet
Author: User

[Elasticsearch] distributed search
Distributed search

This article is translated from the Distributed Search Execution Chapter in the Elasticsearch official guide.

Before proceeding, let's take a look at how the search is executed in a distributed environment. This process is more complex than the basic CRUD operations discussed in Distributed Document Store.

A crud operation will process a document. This document has a unique combination of _ index, _ type and route Value (Routing Value, which is the _ id) of the document by default. This means that we can know which part (Shard) the document is saved in the cluster.

However, the search execution model is much more complex, because we cannot know which documents will match-matching documents can be on any part of the cluster. Therefore, you need to access each part of the index to find out whether they have matched documents.

However, only half of all matching documents are found. The results obtained from Multiple shards must be merged and sorted to obtain a complete result. Then, the search API can return the result. Because of this, search consists of Two phases (Two-phase Process)-Query then Fetch ).


Query Phase)

In the initial stage of the query, the query will be Broadcast (Broadcast) to Copy each Shard in the index (Shard Copy, which can be the primary Shard or replica Shard ). Then, each part executes the search locally, and the matched documents are saved to a Priority Queue.

Priority queue

The priority queue is actually an ordered list of the first N Matching documents. The size of the priority queue depends on the paging parameters:fromAndsize. For example, the following search request creates a priority queue of 100 to save the matching document:

GET /_search{    "from": 90,    "size": 10}

Shows the Query Process:

  1. The client sends a search request to node 3, and node 3 createsfrom + size.
  2. Node 3 forwards the search request to the Primary Shard (Primary Shard) or Replica Shard (Replica Shard) of each Shard in the index ). Then, each shard executes the query locally, and then saves the result to a local one with the same sizefrom + sizeIn the priority queue.
  3. Each part returns the IDs of documents in the priority queue and their sorting values to the Coordinating Node, that is, Node 3. Node 3 then merges all the results into the local priority queue, which is the global query result.

    When a search request is sent to a Node, the Node becomes a Coordinating Node ). It needs to broadcast search requests to the nodes where other associated shards are located, and then merge the intermediate results from them to obtain the global results that can finally be sent to the client.

    The first step is to broadcast the request to the node where all the parts in the index are copied (either primary or replica parts. Like GET requests for getting documents, search requests can also be processed by the primary shard or any copy shard associated with it. This is the reason why adding replica shards (adding more hardware) can increase the search throughput. The Coordination node distributes loads by sending requests to all multipart copies in a Round robin manner.

    Each slice performs a query locally, and then createsfrom + sizeTo save the results. In other words, global search requests can be satisfied locally ). It returns a lightweight result list to the Coordination node-containing only the document IDs and related values required in the sorting process, such_score.

    The coordinating node merges the results returned by other nodes to obtain a global priority queue. Here, the query stage is over.

    Multi-index Search)

    An index can contain one or more primary shards. Therefore, you need to merge the search results from Multiple shards for a single index search request. A search request for multiple or all indexes also works in the same way-but more shards are involved in this process.


    Fetch Phase)

    In the query phase, you can identify the documents that match the search request, but you also need to obtain the document itself. This is the work completed in the acquisition phase. Shows this phase:

    1. The Coordinating Node first identifies the documents to be obtained, and then sends a multi GET request to the related parts.
    2. Each part will read the relevant documents and polish them as required, and finally return them to the Coordination node.
    3. Once all documents are obtained, the Coordination node returns the results to the client.

      The Coordination node first determines which documents really need to be obtained. For example, if you specify{ "from": 90, "size": 10 }The first 90 results will be Discarded (Discarded), and only the documents represented by the remaining 10 results need to be obtained. These documents may come from one or more parts.

      The Coordination node constructs a multi GET request for each part containing the target document, and then sends the request to the part copy involved in the query phase.

      The part will load the body part of the document-that is, the _ source Field-if required, it will also polish the results, such as metadata and Search Snippet Highlighting ). Once the coordination node obtains all the results, it assembles them into a response and returns it to the client.

      Deep Pagination)

      The Query then Fetch process supportsfromAndsizeParameters to complete the paging function. However, this function has restrictions. Do not forget that each part will save a local sizefrom + sizeAll content in the priority queue must be returned to the Coordination node. Then the coordination node needsnumber_of_shards * (from + size)Documents are sorted to ensure the final result is correct.sizeDocuments.

      Based on the document size, the number of parts, and the hardware you are using, you can perform paging (10000 to 50000 pages) on 1000 to 5000 results. However, whenfromTo a certain extent, sorting will become very resource-consuming behaviors such as CPU, memory, and bandwidth. Because of this, we strongly recommend that you do not use deep paging.

      In practice, the number of pages returned by in-depth paging is practically impractical. Users often modify their search criteria after browsing 2 to 3 pages. The culprit is often web crawlers that crawl endless pages. They will overload your servers.

      If you really need to obtain a large number of documents from the cluster, you can use the scan search type with the sorting function disabled to efficiently complete this task. We will discuss it later in the Scan and Scroll sections.


      Search Options)

      Some search string parameters can affect the search process:

      Preference

      This parameter allows you to control which shards or nodes are used to process search requests. It can accept:_primary,_primary_first,_local,_only_node:xyz,_prefer_node:xyzAnd_shards:2,3Such a value. The meanings of these values are explained in detail in the preference search documentation.

      However, the most common value is some arbitrary strings, which are used to avoid the Bouncing Result Problem ).

      Result jump (Bouncing Results)

      For example, when you usetimestampFields are used to sort the results. Two documents have the same timestamp. Because the search request is processed in a Round-robin manner by available multipart copy, the returned order of the two documents may be different because the parts to be processed are different, for example, the order of primary shard processing and the order of replica shard processing may be different.

      This is the result jump problem: every time a user refreshes the page, the order of results is different.

      This problem can be avoided by always specifying the same part for the same user:preferenceThe parameter is set to an arbitrary string, such as the user's Session ID ).

      Timeout

      By default, the Coordination node waits for the response of all shards. If a node is in trouble, it slows down the response of all search requests.

      timeoutThe parameter tells The coordinating node how long it will wait before giving up. If you give up, it will directly return the existing results. Returning a part of results is at least better than returning nothing.

      In the response to a search request, it is useful to indicate whether the search times out and how many fields are successfully responded to by multipart:

      ...    "timed_out":     true,      "_shards": {       "total":      5,       "successful": 4,       "failed":     1     },...
      Routing

      In the Routing Document in the Distributed document Store chapter to the segment (Routing a Document to a shard) section, we have explained the customroutingParameters can be provided during indexing to ensure that all relevant documents, such as documents belonging to the same user, are saved on one part. You can specify one or moreroutingTo limit the search range to a specific shard:

      GET /_search?routing=user_1,user2

      This technology is useful in Designing very large search systems. We will introduce it in detail in the design for scale.

      Search_type

      Besidesquery_then_fetchIs the default search type, and other search types can meet specific purposes, such:

      GET /_search?search_type=count

      Count

      countThe search type is only available in the query phase. You can use it when you do not need a search result. It only returns the number of matched documents or aggregate results.

      Query_and_fetch

      query_and_fetchThe search type combines the query and retrieval phases into one phase. This parameter is used when the target of a search request has only one shard. For example, ifroutingIs an internal optimization measure. Although you can choose to use this search type, this is almost useless.

      Dfs_query_then_fetch and dfs_query_and_fetch

      dfsThe search type has a Pre-query phase (Pre-query phase) used to obtain the Term Frequencies from the related Shard to calculate the frequency of the group's entry. We will discuss this in the Relevance is broken section.

      Scan

      scanThe search type will matchscrollAPIs are used together to efficiently obtain a large number of results. It is done by disabling sorting. It will be discussed in the next section.


      Scan and scroll

      scanSearch type andscrollAPIs are used together to efficiently obtain a large number of documents from ES without any problems in Deep Pagination.

      Scroll

      A Scroll Search allows us to specify an Initial Search, and then continue to obtain batch results from ES until all results are obtained. This is a bit like Cursor in a traditional database ).

      A rolling search generates a real-time Snapshot-it does not find any changes to the index after the initial search. It saves the old data file, so it can save a View that is indexed at the beginning of the data file ).

      Scan

      The most resource-consuming part of deep paging is to sort global results, but if we disable the sorting function, we can quickly return all documents. We can usescanSearch type. It tells ES not to execute sorting, but to let each part that has returned results return the next batch of results.

      To use scan and scroll, we set the search typescanAnd inputscrollParameter to tell ES how long scroll will be available:

      GET /old_index/_search?search_type=scan&scroll=1m {    "query": { "match_all": {}},    "size":  1000}

      The above request will open scroll for one minute.

      The response to this request does not contain any results, but it contains_scroll_idIt is a base64-encoded string. Now you can_scroll_idSend_search/scrollTo obtain the first batch of results:

      GET /_search/scroll?scroll=1m c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0 NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YWxfaGl0czoxOw==

      This request will keep scroll available for one minute._scroll_idYou can pass in the Request body, URL, or query parameter.

      Note that we specify?scroll=1m. The expiration time of scroll is refreshed every time the scroll request is executed. Therefore, it only needs to give us enough time to process the current batch of results, instead of all matching documents.

      The response to this scroll request contains the first batch of results. Although we specifysize1000. We can actually get more documents.sizeWill be used by each shard, so each batch can get upsize * number_of_primary_shardsDocuments.

      NOTE

      The scroll request returns a new_scroll_id. Every time we execute the next scroll request, we need to input_scroll_id.

      If no result is returned, all matching documents are processed.

      TIP

      Clients officially provided by ES provide tools for scanning and scroll to encapsulate this function.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.