Elasticsearch Routing documents to shards

Source: Internet
Author: User

Routing Documents to shards

When you index a document, it is stored on a single primary shard. How does Elasticsearch know which shard the document belongs to? When you create a new document, how does it know if it should be stored on Shard 1 or Shard 2?

The process cannot be random because we will retrieve the document in the future. In fact, it is determined by a simple algorithm:

shard = hash(routing) % number_of_primary_shards

routingThe value is an arbitrary string, which is, by default, _id but can also be customized. This routing string generates a number by a hash function and then divides by the number of primary slices to get a remainder (remainder), which is always the range of the remainder 0 , which is the number_of_primary_shards - 1 shard where the particular document resides.

This also explains why the number of primary shards can only be defined when the index is created and cannot be modified: if the number of primary shards changes in the future, all previous route values are invalidated and the document is never found.

Sometimes users think that a fixed number of primary shards can make subsequent expansions difficult. In reality, some techniques can make scaling easier when you need it.

All document APIs (,,,,, get index delete bulk update mget ) receive a routing parameter that is mapped to a shard from the definition document. Custom route values ensure that all related documents-such as documents belonging to the same person-are saved on the same shard. We'll explain why you need to do this in the Extensions section.

Reference: http://es.xiaoleilu.com/040_Distributed_CRUD/05_Routing.html

And why do we need a custom routing model? First of all, the default routing mode can meet our needs in many cases-the average data distribution, which is transparent to us, and most of the time performance is not a problem. But after we have a deeper understanding of the characteristics of our data, using a custom routing pattern may give us better performance.

In general, Elasticsearch is how to distribute data to individual shards, and which Shard stores which kind of document is not important. Because the query command is distributed to each shard, it is OK. The only key point is the algorithm, which distributes the data equally to the individual shards. When you delete or update a document, things get a little more complicated. In fact, this is not a big problem. As long as the Shard algorithm is guaranteed to process the document, the same mapping value is generated for the same document identity. If we have such a sharding algorithm, Elasticsearch knows how to navigate to the correct shard when the document is processed. However, when choosing a storage shard for a document, is it easier to adopt a smarter approach? For example, to store a particular type of book on a specific Shard, so that when searching for this kind of books can avoid searching for other shards, but also avoids the merging of multiple Shard search results. This is where the routing function (routing) comes in. The routing feature provides Elasticsearch with information to determine which shards are used for storage and querying. The same route value is mapped to the same shard. This is basically saying: "By using the user-supplied routing values, you can do directed storage, directed search." ”

Suppose you have an index of 100 shards. What happens when a request is executed on the cluster?

1. The request for this search is sent to a node

2. The node receiving the request broadcasts the query to each shard of the index (either the primary Shard or the replication shard)

3. Each shard executes this search query and returns the result

4. Results are merged, sorted, and returned to the user on the channel node


Because by default, Elasticsearch uses the ID of the document (similar to the self-increment ID in the relational database, and, of course, if you do not specify an ID, Elasticsearch uses a random value) to distribute the document evenly across all the shards. This resulted in the Elasticsearch not being able to determine the location of the document, so it must broadcast the request to all 100 shards to execute. This also explains why the number of primary shards is fixed at the time of index creation and can never be changed. Because if the number of shards changes, all previous route values become illegal and the document is lost.

The original query statement: "Please tell me how many documents USER1 have in common"

Query statement after using custom routing (on USESR ID): "Please tell me how much the document number of USER1, it is on the third Shard, the other shards will not be scanned."


Specify a personalized route

All document APIs (Get,index,delete,update and Mget) can receive a routing parameter that can be used to form a personalized document shard map. A personalized routing value ensures that the relevant document is stored on the same shard-for example, all documents belonging to the same user.


The first and more straightforward method is to specify the routing parameter directly in the URL of the request:

    Curl-xpost ' http://localhost:9200/store/order?routing=user123 '-d '      {          "productName": "Sample",          " CustomerID ":" User123 "      } '  

This allows us to place a document with the same CustomerID on the same shard as the user's CustomerID value.

The second method is to extract the corresponding route values directly from the document:

    Curl-xput ' http://localhost:9200/store/order/_mapping '-d '      {          "order": {              "_routing": {                  "required": True,                  "path": "CustomerID"}}      } '  

This method works the same way as the first method, but one thing to note is that this method is less efficient than the first method because the first method determines the value of the route directly in the requested parameter, and the second method first needs to read the document in and extract the corresponding route value from it.

Query using the routing mechanism

The query using the routing mechanism is also very straightforward, just specify the corresponding route value in the query:

    Curl-xget ' http://localhost:9200/store/order/_search?routing=user123 '-d '      {          "query": {              "filtered": {"                  query": {                      "Match_all": {}                  },                  "filter": {"term                      ": {                          "UserID": "User123"                      }                  }}}      '  

By specifying the route value, we can directly navigate to the Shard where the user123 document is located, without sending a request to all the nodes of the index. In this way, the waste of system resources will be greatly reduced.

Of course, you can also specify multiple route values at the same time, and the method is obvious, just specify multiple route values in the query parameters:

    Curl-xget ' Http://localhost:9200/forum/posts/?routing=Admin,Moderator '-d ' {} '  
A summary of the routing mechanism

In fact, if the routing mechanism is not explicitly indicated, the routing mechanism is actually functioning, but the default route value is the ID of the document. The demand for personalized routing is primarily business-related. The default route (in the case of an automatic generated ID) visually assigns all documents randomly to a shard, and the personalized route values are business-related. This also causes a number of potential problems, such as user123 itself, there are a lot of documents, there are hundreds of thousands of, and most of the other users only a few documents, which will lead to the user123 of large shards, data migration, especially when more than one of these users in the same shard, The phenomenon is more pronounced. The specific use or the actual application scenario to choose.

Reference: http://blog.csdn.net/cnweike/article/details/38531997

Elasticsearch Routing documents to shards

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.