1. Select the Tablet key
Choosing a good chip key is critical, and if you choose a bad tablet key, it can ruin your application immediately or when the traffic is large, or it might lurk, wait, and maybe suddenly ruin your application.
On the other hand, if you choose a good chip key, as long as the application is still working, and as long as the discovery of increased traffic to quickly add the server, MongoDB will ensure that the correct operation continues.
As previously learned, the slice key determines the distribution of the data in the cluster, so you will want to have such a tablet key, which can not only spread the reading and writing, but also keep the data being used together, these seemingly contradictory goals are often achievable in reality.
We first pick a few negative examples of the key to find fault, and then take a few better examples to figure out, MongoDB Wiki also has a page with the selection of key related and very good content, you can see.
1.1, small base tablet key
Some people don't really understand or trust how MongoDB automatically assigns data, so they always follow this line of thinking: I have 4 shards, so it's a very bad idea to use a 4 possible field to do the chip key.
Suppose we have an application that stores user information, and each document has a continent field that represents the user's region, and its field values can be "Africa", "Antarctica", "Asia", "Australia", "Europe", "North Ameraica "or" South America ", considering that we have a data center on every continent-perhaps not Antarctica-and want to provide user data for it from the data center where people live, we decided to fragment it by that field.
The collection starts with a shard of the initialization block of a data center (negative infinity, infinity), and all insertions and readings fall on this block, and once it becomes large enough it is divided into two blocks (the range is (negative infinity, "Europe") and ["Europe", positive infinity), so All documents from Africa (Africa), Antarctica (Antarctica), Asia (Asian) and Australia (Australia) will be divided into the first block, all from Europe (Europe), North America (northern America) or south America) data is divided into a second block, and as more documents are added, the collection will eventually become 7 blocks, as shown here:
One shard per Oceania
And then what?
MongoDB can not further split these blocks, the block can only become more and more large, although there is no problem, but when the server disk space to start the primary key consumption of the problem will be paid to sleep, in addition to buy a larger disk, you can not do anything.
Due to the limited amount of data on the chip key, the chip key becomes a small cardinal chip key (Low-cardinality Shard key). If you choose a very small base, you will end up with a bunch of blocks that are huge and can't be moved, and you can't split them, and they'll make you extremely well-off, you know.
If you do this in order to manually allocate data, then do not use MongoDB built-in shard mechanism, otherwise it will fight with you to the end, of course, regardless of the sample you can still manually shard the collection, write your own router, and read and write to any one server, Just choosing a key and letting MongoDB do it for you is easier.
This rule applies to any key with a limited number of values, it is important to remember that if a key in a set has n values, then there can be only n blocks, so there can be only n shards. If you are going to use a small cardinal chip key because you need to make a large number of queries on that field, use the combo sheet key (a slice key contains two fields) and make sure that the second field has very many different values that MongoDB can use for segmentation.
If a collection has a lifecycle (for example, a new collection is created every week and you know that the amount of data in a week is not close to the maximum capacity of any shard), you can choose this lifecycle as the slice key.
This example is not just about choosing a small cardinal chip key, but also about the attempt to add Data Center awareness (data-center Awareness) support in the MongoDB shard mechanism, so far, data center awareness is not supported on shards, If you are interested in this, you can view or vote on the relevant questions. The problem with the user is that the extensibility is not very good, what if your application is popular in Japan recently? You might want to add a second shard to cope with the number of visits in Asia. But how do you plan to migrate your data? A data block grows to several gigabytes, you can not migrate it, and can not split the block, because the whole block has only one slice key value, because the chip key value cannot be updated, it is not possible to update all documents to use a more unique slice key value, delete each document, update the slice key value, and then re-save it back , but it's not a quick operation for large databases. The best thing you can do is to start using Asia,japan instead of the simple Asia when inserting the document, so there will be a batch of old documents whose slice key value should be asia,japan but Asia, so the application logic has to support both cases. In addition, once you start owning more granular chunks, you can't guarantee that MongoDB will place them where you want them (unless you close the balancer and handle everything manually). Data Center awareness is important for large applications, and it has a high priority for MONGODB developers, and choosing a small technology tablet key in this period is never a good solution.
1.2, ascending tablet keyReading data from RAM is faster than reading from disk, so the goal is to have as much access to the data in memory as possible, so if some of the data is always accessed together, we want the chip keys to keep them out of the way, and for most applications, the new data is always accessed more often than the old data. So people tend to try to use fields such as timestamps or objectid to make chip keys, but this is not as feasible as they might expect. For example, we have a microblogging-like service, each of which contains a short message, a sender, and a send time, we fragment by the sending Time field, and the value is the number of seconds elapsed since the first year of A.D. As always, starting from a database (negative infinity, positive infinity), all insertions will fall on this shard until it splits into two blocks, because it is separated from the chip key midpoint, so at the moment of our separation block, the timestamp is likely to be greater than the median, This means that all inserts will fall to the second block, no more inserts will hit the first block, and once the second block is filled, it will split into another two blocks, such as [1294516901,1294930163] and [1294930163, positive infinity] two blocks, But since the time from now on is 1294930163, all new insertions are added to the block of [1294930163, positive infinity], and the pattern continues, and all data is always added to the last, a block of data, where all the data is added to a shard, This tablet key creates a single, non-distracting hotspot.
This rule applies to any ascending key value, not to a timestamp, other examples include objectid, date, and self-increment primary key, which you will face as long as the key value tends to infinity.
Basically, this kind of chip key is always a bad idea, because it causes the hotspot to exist, if the traffic is not large and can withstand all read and write with one Shard, that is OK, of course, if you encounter a traffic spike or the application becomes more popular, it stops working and is difficult to repair. Unless you are very aware of what you are doing, do not use the Ascending tab key, there must be a better chip key exists, should avoid the use of this one.
1.3. Random Chip keySometimes in order to avoid hot spots, people will choose a random value of the field to be fragmented, the use of such a chip key is good at first, but as the amount of data becomes larger, it will become more and more slow. Suppose we store thumbnails of photos in a shard collection, each containing the binary data of the photo, the MD5 hash of the binary data, and a description, the time taken and the photographer, we decide to do the sharding on the MD5 hash value. As the collection grows, we end up with a set of data blocks that evenly distributes each shard, so far, now, assuming we're very busy and a block is filled and split on Shard 2, configure the server to notice that the Shard 2 score slice 1 has 10 blocks and determines that the gap between shards should be flattened, This allows MongoDB to randomly load 5 chunks of data into memory and send it to Shard 1, taking into account the randomness of the data series. In general, this data may not appear in memory, so this time MongoDB will put more pressure on RAM, but also will cause a large number of disk IP. In addition, there must be an index on the tablet key, so if you choose a random key that is never based on it, it is basically a waste of an index, and it is important to keep the number of indexes as low as possible, considering that each additional index will make the write operation slower.
1.4, good chip keyWhat we really need is a scheme that takes the access pattern into account, and if the app accesses 25GB of data regularly, we want all the splits and apologies to happen on this 25GB data, rather than random access to the data so that there's constantly new data to copy from disk to memory. So we hope to find such a key, it has good data local characteristics, but there is not too local to cause hot spots.
- Quasi ascending key plus search key
Many applications access new data more frequently than old data, so we want the data to be roughly sorted by time, but also evenly distributed, so that we can keep the data we are reading and writing in memory and distribute the load evenly across the cluster. For example, there is an analysis program that allows users to access data over the past one months on a regular basis, and we want to keep the data as easy to use as possible, so we can shard on {month:1,user:1}, where month is a coarse-grained ascending field. That is, every month it has a larger value, and the user is the second field because we often query the data for a particular user.
Can the search field also be an ascending field? No, if it is, then the tablet key will be downgraded into an ascending tab key, thus allowing you to face the hot issue of ordinary ascending keys. What should the search field be? The search field is best for things that the application can use for querying, such as user information (such as the example above), file name fields, or GUIDs, which should be non-ascending, distributed randomly, and with appropriate cardinality.
Part V Architecture Chapter 20th MongoDB sharding Architecture (tablet key selection)