[Mongodb translation] select an appropriate shard key

Source: Internet
Author: User

It is important to select an appropriate shard key for a collection. If this set is very large, it will be very difficult to modify the shard key in the future. If you have any questions, please go to the forum or IRC for help.

 

Sample document

View plain
  1. {
  2. Server: "ny153.example.com ",
  3. Application: "apache ",
  4. Time: "2011-01-02T21: 21: 56.249Z ",
  5. Level: "ERROR ",
  6. Msg: "something is broken"
  7. }

Cardinality)

All data in a set is split into multiple chunks. A data block contains data of a specific range of shard keys. Select an appropriate shard key. Otherwise, you will obtain a large data block that cannot be split.

If you select:

View plain
  1. {Server: 1}

As the shard key, all data about a server will exist in a data block. You can easily think that the data of a server will exceed 64 MB (the default data block size ). If the shard key is:

View plain
  1. {Server: 1, time: 1}

You can split the data of a single server into milliseconds. As long as a single server does not have 200 MB/S, there will be no data blocks that cannot be split.

It is very important to keep the data block in a proper size, so that the data can be evenly distributed in the cluster and it will not be too costly to move a data block.

 

Horizontal write

One of the main reasons for using sharding is distribution write operations. For this purpose, write operations should be distributed to different data blocks as much as possible.

Use the preceding example again and select:

View plain
  1. {Time: 1}

As the shard key, all write operations are concentrated in the latest data block. If shard key is selected:

View plain
  1. {Server: 1, application: 1, time: 1}

Then, each server: Application ing will be written to different places. If there are 100 types of servers: Application ing and 10 servers, each server will be allocated about 1/10 write operations.

It should be noted that, since a very important part of ObjectId is generated based on time, using ObjectId as the shard key is equivalent to directly using the time value.

 

Query isolation

Another consideration is the number of shard shards to be distributed in any query. Ideally, a query operation is directly distributed to mongod with the expected data through mongos. If you know that most queries use those conditions, using these condition attributes as the shard key can improve the efficiency.

The query can still work even if the query condition does not contain the shard key. Because mongos does not know which shard has the expected data, mongos will distribute the request order to all shard, which will increase the response time and network data traffic and server load.

 

Sort

If a query contains a sort request, the request is distributed to the shard as before when no sort request is required. Each shard executes the query and then performs sorting locally (if there is an index, the index will be used ). Mongos merges the sorted results returned by shard and returns the merged data to the client. In this way, mongos only requires a small amount of work and a small amount of RAM.

Reliability

An important aspect of sharding is the impact on the entire system if the entire shard cannot be accessed (even if a reliable replication group is used.

For example, if you have a system similar to twitter, the Comment record is similar:

View plain
  1. {
  2. _ Id: ObjectId ("4d084f78a4c8707815a601d7 "),
  3. User_id: 42,
  4. Time: "2011-01-02T21: 21: 56.249Z ",
  5. Comment: "I am happily using MongoDB ",
  6. }

Because the system is very sensitive to write operations, if you want to distribute write operations to each server, you need to use "_ id" or "user_id" as the shard key. "_ Id" can give you better granularity and write diffusion. But once a shard goes down, it will affect almost all users (some data is lost ). If you use "user_id" as the shard key, a small number of users will be affected (for example, in a cluster consisting of five shard, the percentage is 20% ), even if these users no longer see any of their data.

 

INDEX OPTIMIZATION

As described in the previous chapter on indexes, reading/updating a part of indexes often results in better performance. This is because the "active" part can stay in RAM for most of the time. Although the shard key mentioned above can distribute write operations to various shard, it still belongs to the index of each mongod. As an alternative, breaking the timestamp into some form and serving as the shard key prefix can bring some benefits, which can reduce the size of frequently accessed indexes.

For example, if you have an image storage system, the image record is similar:

View plain
  1. {
  2. _ Id: ObjectId ("4d084f78a4c8707815a601d7 "),
  3. User_id: 42,
  4. Title: "sunset at the beach ",
  5. Upload_time: "2011-01-02T21: 21: 56.249Z ",
  6. Data :...,
  7. }

You can customize the _ id that contains the month of the upload time and the unique identifier (such as ObjectId and md5 value of the data) to replace the default _ id. The new record is similar:

View plain
  1. {
  2. _ Id: "2011-03474d084f78a4c8707815a601d7 ",
  3. User_id: 42,
  4. Title: "sunset at the beach ",
  5. Upload_time: "2011-01-02T21: 21: 56.249Z ",
  6. Data :...,
  7. }

It is used as the shard key and the _ id used to access a document. It can well distribute write operations to all shard. It also reduces the size of the indexes to be accessed by most queries.

Further notes:

  • At the beginning of each month, only one shard is accessed to know that the balancer starts to split data blocks. To avoid this potential low performance and data migration, we recommend that you add a range value before the time (for example, 5 or a larger range value if you have five servers ).
  • Further, you can include the user id in the image id so that all documents of the same user can be stored in the same shard. For example: "2011-034742_4d084f78a4c8707815a601d7"

 

GirdFS

Based on different needs, there are several different methods to partition GridFS. A sharding method based on existing indexes is as follows:

  • The "files" set is not sharded. All file records will be stored in one shard. We strongly recommend that you ensure that the shard is highly reliable (using a replication group with at least three nodes ).
  • The "chunks" set should use the index "files_id: 1" for sharding. The existing "files_id, N" Index created by the driver cannot be used for partitioning on "files_id" (this is a limitation and will be fixed in the future ). Therefore, you need to create an index for "files_id" separately. Use "files_id" for sharding to ensure that a file is stored in the same Shard, so that the "filemd5" command can work. Run the command: view plain
    1. > DB. fs. Chunks. ensureindex ({files_id: 1 });
    2. > DB. runcommand ({shardcollection: "test. fs. chunks", key: {files_id: 1 }})
    3. {"Collectionsharded": "test. fs. chunks", "OK": 1}

Files_id uses ObjectId by default. Therefore, files_id is incremental, and all new GridFS data blocks are sent to the same shard. If your write load is too large for a single server to cope with, you may need to consider using another shard key or using another value as the file _ id.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.