Shard key selection in MongoDB

Source: Internet
Author: User
Tags mongodb driver

To partition the collection objects stored in the MongoDB database, you need to select the shard key. The selection of the shard key directly determines whether the data distribution in the cluster is balanced and whether the cluster performance is reasonable. So what fields should we choose as the shard key? Consider the following.

The following document that records logs is used as an example:

{

Server: "ny153.example.com ",

Application: "Apache ",

Time: "2011-01-02t21: 21: 56.249z ",

Level: "error ",

MSG: "something is broken"

}

Base

All data of a sharded collection in MongoDB is stored in numerous chunks. A chunk stores data within a range of the shard field. It is very important to select a good shard field. Otherwise, a large chunk cannot be split.

Take the preceding log as an example. If {server: 1} is selected as a shard key, all data on a server is in the same chunk, it is easy to think that the log data on a server will exceed 200 MB (the default chunk size ). If the shard key is {server: 1, time: 1}, the log information on a server can be split until the millisecond level, and there is absolutely no chunk that cannot be split.

It is very important to maintain the chunk size at a reasonable size. Only in this way can data be evenly distributed, and the cost of moving chunk will not be too high.

Scalable write operations

One of the main reasons for using sharding is to distribute write operations. To achieve this goal, it is important to distribute write operations to multiple Chunks as much as possible.

With the above log instance, selecting {time: 1} as the shard key will cause all write operations to fall into the latest chunk, thus forming a hotspot area. If {server: 1, application: 1, time: 1} is selected as the shard key, the log information of the application on each server will be written in different places, if there are 100 servers and 10 server pairs, each server will share 1/10 of write operations.

Query isolation

The other thing to consider is how many shards a query operation will provide services. Ideally, a query operation is directly routed from the mongos process to a MongoDB, And the MongoDB has all the data for this query. Therefore, if you know that the most common query operations all use server as a query condition, using server as a starting shard key will make the entire cluster more efficient.

Any query can be executed, no matter what is used as the shard key. However, if the mongos process does not know which MongoDB Shard has the data to be queried, mongos will allow all mongod shards to perform the query operation, and then summarize the results and return them. Obviously, this increase in server response time will increase network costs and unnecessary load.

Sort

When you need to call sort () to query the sorted results, mongos can query the minimum number of shards Based on the leftmost field of the shard key, return the result information to the caller. This will take the least time and resource cost.

On the contrary, if sort () is used for sorting, the field used for sorting is not the shard key at the left (START, then mongos will have to concurrently pass the query request to each shard, and then merge the results returned by each shard before returning the request to the requester. This will increase the extra burden on mongos.

Reliability

A very important factor in selecting a shard key is the size of the affected chunk (even with a seemingly trustable replica set) If a shard is completely inaccessible ).

Assume that there is a system similar to twiter, and the comment record is similar to the following format:

{

_ ID: objectid ("4d084f78a4c8707815a601d7 "),

User_id: 42,

Time: "2011-01-02t21: 21: 56.249z ",

Comment: "I am happily using MongoDB ",

}

Because this system is very sensitive to write operations, You Need To flat write operations to all servers. In this case, you need to use ID or user_id as the shard key. Using ID as the shard key has the largest granularity flattening, but when a shard goes down, it will affect almost all users (some data is lost ). If user_id is used as the shard key, only a very small percentage of users will be affected (20% of users will be affected when five shards exist ), however, these users will no longer see their data.

INDEX OPTIMIZATION

As mentioned in other chapters, if only some of the indexes are read or updated, the performance is usually better, because the "active" part can reside in the memory most of the time. The method described above for selecting the shard key is to ensure even distribution of the final data. At the same time, the index information of each shard D is evenly distributed. On the contrary, using a timestamp as the starting field of the shard key will help to limit common indexes to a smaller part.

Assume there is an image storage system. The image record format is similar to the following:

{

_ ID: objectid ("4d084f78a4c8707815a601d7 "),

User_id: 42,

Title: "sunset at the beach ",

Upload_time: "2011-01-02t21: 21: 56.249z ",

Data :...,

}

You can also construct a customer ID that includes the monthly information corresponding to the Image Upload time and a unique identifier (objectid, MD5, etc ). The record looks like the following:

 {

_ ID: "2011-01-02_4d084f78a4c8707815a601d7 ",

User_id: 42,

Title: "sunset at the beach ",

Upload_time:
"2011-01-02t21: 21: 56.249z ",

Data :...,

}

The customer ID is used as the shard key, and this ID is also used to access this document. That is, data can be evenly distributed across nodes, and the proportion of indexes used by most queries is also reduced.

Further speaking:

At the beginning of each month, there was only one server in the initial period to access data. As the data volume increases, the Load balancer began to split and migrate data blocks. To avoid potential inefficiency and data migration, it is wise to create a range in advance. (If there are five severs, there are five intervals)

You can continue to improve and include user_id in the image ID. In this way, all the documents of a user are stored in one part. For example, use "2011-01-02_42_4d084f78a4c8707815a601d7" as the image ID.

Gridfs

Based on different requirements, gridfs has several different sharding methods. Based on pre-existing indexes, this is a common partitioning method:

1) The "Files" collection will not be sharded, and all file records will be located on one partition, we recommend that you make this part highly flexible (using a replica set consisting of at least three nodes ).

2) The "chunks" set (Collection) should be split and indexed "files_id: 1 ". The "files_id, N" Index created by the MongoDB driver cannot be used as the partition key (this is a partition constraint and will be repaired later ), therefore, you have to create an independent "files_id" index. The reason for using "files_id" as the shard key is that all chunks of a specific file are on the same Shard, which is very secure and allows running the "filemd5" command (requires a specific driver ).

Run the following command:

> DB. fs. Chunks. ensureindex ({files_id: 1 });

> DB. runcommand ({shardcollection: "test. fs. chunks", key: {files_id: 1 }})

{"Collectionsharded": "test. fs. chunks", "OK": 1}

Because the default files_id is an objectid, files_id will increase in ascending order. Therefore, all chunks of gridfs will be accessed from a single point of fragmentation. If the write load is high, you need to use another shard key, or use another value (_ id) as the shard key.

The factors that need to be considered when selecting the shard key have a certain degree of confrontation, and it is impossible to possess everything. In actual use, we still need to weigh the weights based on different requirements and give up some appropriate measures. There is no universal sharding method, and the demand is king.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.