Selection of the slice key in MongoDB

Last Update:2018-12-05 Source: Internet

Author: User

Tags md5 hash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from http://www.cnblogs.com/spnt/

After the entire MongoDB architecture has been deployed, it is time to truly test the capabilities of the architects: how to select the chip key.

If you select an inappropriate chip key, it may cause your entire application system to crash when the traffic increases. The same good chip key can constitute a benign ecosystem, add or delete servers as needed, and MongoDB will ensure that the system continues to run correctly.

Let's take a look at some inappropriate chip keys.

1. Small base partition key

Suppose we have an application that stores user information. Each document has a continent field that stores the user's region. The values include Africa, Antarctica, Asia, Australia, and Europe, north America, South America. Considering that we have a data center in every state and want to provide data from the data center where people are located, we decided to split the data based on this field.

A set starts with the initial block (-∞, ∞) of a shard in a data center. All the inserts and reads are on this block. Once it becomes large enough, it will be divided into two parts (-∞, Europe) and [Europe, ∞). In this way, all documents from Africa, Antarctica, Asia and Australia will be allocated to the first part, the rest will be divided into the second part. As more documents are added to the database, the set will eventually become seven, as shown below:

(-∞, Antarctica)

[Anrarctica, Asia)

[Asia, Australia)

[Australia, Europe)

[Europe, North America)

[North America, South America)

[South America, ∞)

What then?

MongoDB cannot further split these blocks, but it will become larger and larger soon. Although there will be no problems for the time being, when the server hard disk space is used up, you will not be able to, but you can only purchase new hard disks, the tragedy is that hard disks have limits.

Due to the limited number of partition keys, this type of partition key is called a small base partition key. If you select a partition key with a small base, in the end, you will surely get a bunch of huge unmovable and inseparable blocks. What is the feeling of doing maintenance and scaling? You know.

If you want to use a small base partition key for a large number of queries on a field, you need to use the composite partition key. One partition key contains two fields, make sure that the second field has many different values for MongoDB to use for segmentation. For example, most queries are associated with time. You can use the time field as the second field to reduce the load.

2. Ascending Order partition key

Reading data from Ram is faster than reading data from a disk, so the goal is to access as much data as possible in the memory. Once, if some data is always accessed together, we hope to keep them together. For most applications, new data is accessed more frequently, so fields such as timestamps or objectid are often used for partitioning, however, this is not as feasible as we expect.

For example, we have a service similar to Weibo. Each document contains a message, sender, and sending time. We use time-based sharding. Let's see how MongoDB runs.

First, a large part (-∞, ∞) is inserted into the part, and then split, for example (-∞, 1294516901), [1294516901,-∞) ---- (the timestamp is used). Because blocks are separated from the midpoint of the slice key, data is inserted to the second chunk at the very beginning of the split, it will not insert the first block. Once the second block is fully inserted, it will be split into two blocks, but the same will only insert data in the last block, this situation will continue, which leads to a single and non-dispersed hot spot. The pressure on the site is visible during peak hours.

3. Random chip key

Sometimes, in order to avoid hot spots, a random field is used for partitioning. Using this piece key is not bad at first, but as the data size increases, it slows down.

For example, we store the photo thumbnails In the fragment set. Each document contains the binary data of the photo, the MD5 Hash Value of the binary data, and the description fields, we decided to partition the MD5 hash value.

As the set grows, we will eventually get a group of data blocks evenly distributed in each shard. It is normal at present. Now let's assume that we are very busy and a part on Part 2 is filled and split. The configuration server notices that Part 2 has 10 more parts than Part 1 and determines that the gap between parts should be flattened, in this way, MongoDB needs to randomly load 5 pieces of data into the memory and send it to piece 1. Considering the randomness of the data sequence, the data may not appear in the memory in general, therefore, MongoDB will put more pressure on Ram and cause a large number of disk I/O operations.

In addition, the partition key must have an index. Therefore, if you select a random key that never depends on the index query, you can say that an index is wasted. On the other hand, increasing the index will reduce the write operation speed, therefore, it is necessary to reduce the index volume.

So what kind of chip key is a good chip key?

From the above analysis, we can conclude that a good chip key should have a good Data Locality, but it will not lead to hot spots because it is too local.

1. Quasi-ascending key and search key

Many applications access new data more frequently, so we want data to be roughly sorted by time, but it also needs to be evenly distributed so that we can keep the data we are reading and writing in the memory, it can also be evenly distributed in the cluster.

We can achieve this through a combination of chip keys such as {coarselyascending: 1, search: 1}. Among them, each value of coarselyascending can correspond to dozens to hundreds of data blocks, however, the search field is usually used by applications to query fields.

For example, an analytics program allows users to access data from the past month on a regular basis, and we want to keep the data as easy as possible. Therefore, you can use {month: 1, user: 1}. Now we will talk about the running process.

First, a big data block (-∞,-∞), (∞, ∞). When it is filled up, MongoDB will be automatically divided into two parts, for example:

(-∞,-∞), ("2012-07", "Susan "))

[("2012-07", "Susan"), (∞, ∞ ))

Assuming that it is still in July, all write operations will be evenly distributed to two blocks. All data with a username less than Susan is written into Block 1, and all data with a username greater than Susan is written into Block 2. Then the entire ecosystem will be healthy, and wait until August, mongoDB began to create a new block, and the distribution was still balanced (this is not always balanced, there must be a smooth process). By, when no one accessed data in, it began to exit the memory, it no longer occupies resources.

Certificate --------------------------------------------------------------------------------------------------------------------------------------

Of course, the application is different in different scenarios. The above is just a basic thing. The specific selection of the chip key should also be done according to your own program. The following issues should be considered during key selection.

1. What is the write operation? How big is it?

2. How much data does the system write in hours? What about every day and peak hours?

3. Those fields are random and those are increasing.

4. What is the read operation? What is the data accessed by the user?

5. Is the data index done? Should it be indexed?

6. Total data volume

In general, you need to understand your data before performing sharding.

The above content refers to the book "deep learning MongoDB". To learn more about MongoDB optimization, this book is definitely worth reading. This book also lists 50 MongoDB optimizations, maintenance and so on

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More