MongoDB sharding
Why do I need Sharded cluster?
MongoDB currently has three core advantages: "flexible mode" + "high availability" + "scalability". Flexible mode is realized through json documents, high availability is guaranteed through replication sets, and scalability is guaranteed through sharded cluster.
Alibaba Cloud Simple Application Server: Anti COVID-19 SME Enablement Program
$300 coupon package for all new SMEs and a $500 coupon for paying customers.
When to use sharding technology
Storage capacity requirements exceed the capacity of a single disk
The active data set exceeds the memory capacity of the single machine, causing many requests to read data from the disk, affecting performance
Write IOPS exceeds the write service capacity of a single MongoDB node
Fragmentation technology makes the data in the collection scattered into multiple fragmentation sets. Make MongoDB have horizontal development.
Sharded cluster sharding architecture
Sharded cluster is composed of three components: Shard, Mongos and Config
server.
Mongos is the access entrance of Sharded cluster,
Mongos itself does not persist data, all metadata of Sharded cluster will be stored in Config Server
The user data will be scattered and stored in each shard. After Mongos is started, it will load metadata from the config
server, start to provide services, and correctly route user requests to the corresponding Shard.
Data distribution strategy
Sharding supports a single set of data scattered on multiple shards. Currently, there are mainly two data fragmentation strategies.
Range based sharding
Hash based sharding
Range slicing
As shown in the figure, the collection is fragmented according to fields. Store the data of a set in different shards according to the scope of the field.
On the same shard, each shard can store many chunks. The information about which shard the chunk is stored in will be stored in the Config server, and mongos will automatically perform load balancing according to the number of chunks on each shard.
Range sharding is suitable for searching within a certain range, for example, searching for data with the value of X between [100-200], and mongo routing can be directly located to the Chunk of the specified shard according to the metadata stored in the Config server
Disadvantages If the shardkey has a significant increase (or decrease) trend, the newly inserted documents will mostly be distributed to the same chunk, and the writing ability cannot be expanded
Hash fragmentation
Hash sharding calculates the hash value (64bit integer) based on the user's shard key, and distributes the document to different chunks according to the hash value according to the "range sharding" strategy
Advantages Hash sharding is complementary to range sharding, which can randomly distribute documents to each chunk, fully expands the writing ability, and makes up for the shortcomings of range sharding.
Disadvantages but not efficient service range query. All range queries must be distributed to all shards in the backend to find documents that meet the conditions.
Reasonable choice of shard key
When choosing a shard key, you should choose reasonably according to the needs of the business and the advantages and disadvantages of the two methods of "range sharding" and "Hash sharding". The data should be sharded according to the actual reason of the field, otherwise it will produce too large Chunk
Mongos
Mongos serves as the access entrance of the Sharded cluster. All requests are routed, distributed, and merged by mongos. These actions are transparent to the client driver. The user connects to mongos just like connects to mongod.
Query request
If the query request does not contain the shard key, the query must be distributed to all shards, and then the combined query results are returned to the client
If the query request contains the shard key, the chunk to be queried is directly calculated based on the shard key, and the query request is sent to the corresponding shard
Write request
The write operation must include the shard key. Mongos calculates which chunk the document should be stored in according to the shard key, and then sends the write request to the shard where the chunk is located.
Update/delete request
The query conditions of update and delete requests must include shard key or _id. If it contains shard key, it will be directly routed to the specified chunk. If it only contains _id, the request must be sent to all shards.
Other order requests
Config Server
config database
Config server stores all metadata of Sharded cluster, all metadata are stored in config database
Config Server can be deployed as an independent replication set, which greatly facilitates the operation and maintenance management of Sharded cluster.
config.shards
The config.shards collection stores the information of each shard. You can dynamically add or remove shards from the Sharded cluster through the addShard and removeShard commands
config.databases
The config.databases collection stores all database information, including whether the DB is sharded and primary shard information. For collections that do not have sharding in the database, all data will be stored on the primary shard of the database.
config.colletions
Data sharding is for the collection dimension. After the sharding function is enabled for a database, if you need to store the collection in shards, you need to call the shardCollection command to enable sharding for the collection.
config.chunks
After collection sharding is enabled, a new chunk will be created by default, and the documents (that is, all documents) within the shard key value [minKey, maxKey] will be stored in this chunk. When using the Hash fragmentation strategy, you can also create multiple chunks in advance to reduce chunk migration.
config.settings
The config.settings collection mainly stores the configuration information of the sharded cluster, such as chunk size, whether to enable balancer, etc.
Other collections
config.tags mainly stores sharding cluster tags (tag) related to your washing
config.changelog mainly stores all the change operations in the sharding cluster. For example, the movement of the balancer to migrate chunks will be recorded in the changelog
config.mongos stores the information of all mongos in the current cluster
config.locks stores lock-related information. When operating on a certain collection, such as moveChunk, you need to acquire the lock first to prevent multiple mongos from migrating chunks of the same collection at the same time.