In MongoDB (version 3.2.9), the distribution of data refers to splitting the collection data into chunks (chunk) and distributing them on different shards (shard). There are 2 main ways to distribute data: Balanced distribution based on the number of blocks (chunk) and directed distribution based on the Slice key range (range). The MongoDB built-in equalizer (balancer) is used to split blocks and move blocks, automatically achieving uniform distribution of data blocks on different shard. Balancer only guarantees that the number of chunk on each shard is roughly the same and does not guarantee that the number of doc on each shard is approximately the same.
First, the data is distributed evenly according to the number of chunk
The balanced distribution is automatically implemented by MongoDB, which makes the database schema transparent to application and simplifies the management of the system, making it easy to add and subtract shards into the Shard cluster. The balanced distribution is implemented by the MongoDB built-in equalizer (balancer), which balancer the data distribution according to the collection's indexed field called the Slice key (sharded key). There are generally three types of tablet keys: Ascending tab key, Random slice key, and group-based slice key.
A block (chunk) is a grouping of multiple doc, which is contiguous on an indexed field (the slice key), and each chunk has a certain range of key slices. The default size of the block is 64MB. Some chunk will be very large, containing a lot of doc, but, in MongoDB's view, is still a chunk, and without any doc empty chunk no difference. A balanced distribution ensures that the chunk number of each shard is roughly the same. Therefore, the choice of the chip key directly affects the quality of the Shard.
For example: A MongoDB shard cluster has 3 shard, respectively, Shard1,shar2,shard3. The minimum value of the slice key is: $MinKey, the maximum value is: $MaxKey. The chunk containing the end value $minkey is the smallest block, and the chunk that contains the end value $maxkey is the largest block.
1, Ascending Tab key
Ascending slice keys resemble a date field or a _id field, which is a field that grows steadily over time. If the Shard field is a _id field, there are 10 doc in the set Foo, and one data block in each shard: Chunk1: $MinKey -3,chunk2:4-8,chunk3:9-$MaxKey.
The disadvantage of using the Ascending tab key is that each time a new doc is inserted, it is inserted into the largest chunk, which causes all write requests to be routed to the same shard, causing the largest blocks to grow, constantly being split, and being moved to other shards, resulting in unbalanced data writes. Block movement will increase the amount of disk write extra. The advantage of using the Ascending tab key is that the performance is high when the range is read according to the TAB key.
2, Random Tab key
The Random tab key is the value of the key is not fixed growth, but some irregular key values. Because the write data is distributed randomly, the shards grow at roughly the same speed, reducing the number of chunk migrations. The disadvantage of using random shards is that the location of the write is random, and if a hash index is used to generate a random value, the range query will be slow.
3, group-based Chip keys
A grouping-based slice key is a composite slice key for two fields, the first field is used for grouping, the field's potential is preferably lower, the potential is the number of different values in the same field (distinct value) or the proportion of the field, and the second field is used for self-increment, preferably a self-increment field. This chip key strategy is the best, can realize the multi-hotspot data read and write.
A single mongod is most efficient when processing an ascending write request, and the data needs to be written only to the end of the collection. Based on the grouping of the chip key, the number of packets distributed in the Shard cluster, each shard only a small amount of chunk, so that the data can be written to the distribution of the partition on each shard in the cluster, on a single Shard, in ascending mode to read and write data. There are too many groups on a shard, writing requests is tantamount to random writing, but not good.
Second, directed distribution according to the chip key range
If you want a specific range of chunk to be distributed to a particular shard, you can add a tag to the Shard, and then specify the corresponding slice key range for the tag, so that if a doc is part of the tag's slice key range, it will be directed to a specific shard.
1, specify tag for Shard
Sh.addshardtag ("Shar1", "Shard_tag1"), Sh.addshardtag ("Shar2", "Shard_tag2"), Sh.addshardtag ("Shar3", "Shard_tag2") ;
2, specify the chip key range for tag
Sh.addtagrange ( "Db_name.collection_name", {field: "Min_value"}, {field: "Max_value"}, "Shard_ Tag ")
Each shard tag can use any number of Tag,mongodb's equalizer to move the block, moving the chunk of a particular slice key range to a specific shard.
Third, manual distribution of data
MongoDB built-in equalizer (balancer), automatic data block splitting and movement, sometimes you can turn off balancer, using the Movechunk command to manually move the data block.
1, Close balancer
Connect to a MONGOs, update the Config.setting namespace
Use Config
Db.setting.update ({"_id": "Balancer"},{"Enabled": False},true)
Sh.setbalancerstate (FALSE);
2, split block
A split block is a new boundary point that splits a chunk into two chunk at the boundary point. In MongoDB, the slice keys are sorted from small to large, and the boundary values belong to the chunk on the right.
Sh.splitat ("Db_name.collection_name", {sharded_filed: "New_boundary_value"})
3, moving the block
MongoDB moves the chunk that contains the specified document to the specified shard, and you must use the slice key to find the chunk you want.
Sh.movechunk ("Db_name.collection_name", {sharded_filed: "Value_in_chunk"}, "New_shard_name")
4, Enable Balancer
Sh.setbalancerstate (True)
5, refresh the MONGOs cache
Between the application layer and the data store, there is a query Router that Mongos,mongos synchronizes the configuration data from config server and caches it in MONGOs after the first boot or the metadata of the Shard is updated. Sometimes, MONGOs cannot synchronize the latest configuration information on the config server in a timely manner, resulting in the inability to route to the appropriate chunk, returning the correct data, and using the Flushrouterconfig command to manually refresh the MONGOs cache
Db.admincommand ({"Flushrouterconfig": 1})
Reference Documentation:
Sharding
MongoDB Data distribution