Sharding Cluster in MongoDB

Last Update:2018-07-23 Source: Internet

Author: User

Tags mongodb time interval

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As the volume of MONGODB data increases, it is possible to limit the storage capacity of a single node, and application a large number of accesses can also cause a single node to fail, so a clustered environment needs to be built and the entire dataset split into smaller chunk through sharding scenarios , and distributed on multiple MONGOD nodes in the cluster, and finally achieves the function of storage and load capacity expansion and pressure shunt. In the sharding architecture, each mongod node that is responsible for storing part of the data is called The Shard (fragmentation), and the data blocks distributed on Shard are called Chunk,collections according to the "Shard Key" (called a fragment key) splits the dataset into multiple chunks, and is distributed relatively evenly across multiple shards.

1) The sharding mode disperses the applied data access operations to multiple Shard, each shard only one part of the request, for example, the read operation only needs to access the Shard node holding the data.

2 sharding mode also reduces the amount of data stored per shard node.

The figure above is the topology of the sharded cluster, which contains three components: Shards, config servers, and query routers:

1 Shards: Storage nodes, in order to improve availability and data consistency, usually each shard is a "replica set" structure.

2 Query routers: Queries the routing node, that is, the MONGOs node, MONGOs receives the client request and is responsible for forwarding operations to the appropriate shard or shards according to the routing rules, and then returning the result to the client. It acts as a routing, forwarding function, similar to the proxy layer. The sharded cluster can have multiple MONGOs nodes that can equalize client requests. For sharded clusters, the client (including the shell) accesses the data through MONGOs, and if connected directly to the Shard, only a few fragmented tables of data can be seen.

3 Config Servers: storage cluster of metadata data, the data contains Shards node list, chunk and dataset mapping relationship, lock information and so on; it is the hub part of the cluster, MONGOs uses this information to route requests to specific shards , for a production environment, a cluster must have 3 Condig servers.

For a test environment, the Shard node can be a single mongod, and a config server.

The partitioning of the data is based on the "Shard Key", which requires the "Shard Key" (The fragment key) to be specified for each collection that requires sharding; The fragment key must be an indexed field or a left prefix for a composite index MongoDB the data into multiple chunks according to the partitioning key and distributes them evenly across multiple shards nodes. Currently, MONGODB supports two partitioning algorithms: Interval partitions (range) and hash (hash) partitions.

1 range partition: First Shard key must be a number type, the entire range of upper and lower boundary is "positive infinity", "negative infinity", each chunk cover a section, that is, the overall, any shard key will be covered by a particular chunk. The intervals are closed to the right. Each interval does not overlap and is close to each other. Of course, chunk is not created in advance, but as the chunk data increases and constantly split. (see below)

2 HASH partition: Compute the hash value of Shard key (64 digits), and use this as range to partition, basic method with 1); hash value has strong hashing ability, usually different shard key has different hash value (conflict is limited), This zoning approach allows the document to be more randomly dispersed across different chunks.

A range query that better supports range queries, and a range query based on the specified shard key, router can easily determine which chunks overrides the range and forwards the request to a specific number of shards. However, when the Shard key is monotonically incremented, the range partition causes the data to be unevenly distributed because, for a certain amount of time, all write requests (read requests reading the most recent data) are mapped to a shard, where a few shards host most requests for the system over a period of time.

Hash partitions In contrast, even the monotonically increasing shard key, their hash value is quite different, so the data will be more randomly scattered over multiple chunks, but this introduces the problem of range query, near the Shard The key may be distributed on different chunks or even shards, which means that the range query needs to access all shards, especially when there are operations such as sort, limit, and so on.

The addition and deletion of data and the increase or decrease of shards nodes in the cluster may result in uneven distribution of data, although MONGOs provides a balancer mechanism, which can split (split) and migrate chunks, and ultimately balance the data distribution.

Splitting: A background process is used to avoid chunk growth, when chunk size exceeds the specified chunk size (the default is 64M, which can be ordered), MongoDB will divide this chunk into equivalent 2 , inserts and updates operations can trigger split, the separation of MongoDB will not migrate any data, and will not affect the Shard (split after the Shard will modify the config server in the Metadata,io communication mode with the " Chunk migration ", see below).

Balancing: A background thread is used to manage chunks migrations, balancer can run on any one (multiple) MONGOs, and when the collection data distribution in the cluster is uneven, Balancer will migrate a portion of the chunks from the largest shard of chunks to the shards with the smallest amount of hold, until the balance is reached; at chunk migration, source Shard will send this chunk data all to the target Shard, during which time The source Shard is still responsible for receiving the client's request (read, write), and eventually changing the chunks location information on the config servers. If an exception occurs during the migration, the balancer will terminate this chunk migration and chunk will remain on the original Shard, and MongoDB will delete shard files on the original chunk when the migration succeeds.

The cluster environment can be dynamically adjusted, such as increasing the amount of data to a certain extent, can add shard nodes to the cluster, or remove shard if the amount of data is tightened, which triggers chunks dynamic equilibrium.

I. Composition of sharded cluster

1, Shards: The above known, Shards is to store the actual data MongoDB node, each shard hold multiple chunks. But some collection do not need to sharding, that is, this collection data to save a shard node, also will not be split, we call this node primary shard; In a database, All sharding types of collections are stored on the same primary Shard, but different databases primary their shard may be different. If you want to modify the Primay shard of a database, you can use the "moveprimary" directive.

In a production environment, each shard is usually a replica set structure, in order to avoid data loss or inconsistency. If all members of a Shard replica set are invalidated, this means that the Shard data will not be available, but other shard can still continue to provide read and write services, but application queries need to be able to handle this problem, If you want a shard to still be able to query some of the data, and the missing data can be received, you can specify the "partial" option in the read operation: Java code mongocursor<document> cursor = Collection.find (Filters.eq ("name", "Zhangsan")). BatchSize. Limit (a). Partial (true )

2, config servers: Configure the server, Cluster hub part, to save the metadata data, production environment, need three (exactly) config servers, all config Metadata can be saved successfully when servers is in effect; These three config servers are not replica set structures, they are deployed independently. (see below) but for a test environment, you can have only one config server, and if you have only one config server, it will be a single point for the cluster. If the config servers fails, the entire cluster will not be accessible, and if the metadata data is lost, the entire cluster will not be available.

Config servers saves metadata in the Config database (described later), mongos instance will get metadata from config server and be cached locally and used for routing reads, writes requests. MongoDB will only modify metadata data after "chunk migration" and "Chunk split". When the metadata is needed, the Coordinator (MONGOs) will send the change instructions to the three config servers and obtain their response results, if the result is different, it means that the data is inconsistent, it may require manual intervention; Balancer will also not perform chunk migrations, and MONGOs will not perform chunks splits.

When the MONGOs is started, the metadata information is obtained from config servers, and some run-time errors can cause MONGOs to retrieve metadata again. In addition, some locks have been saved in config server, which we'll explain later.

As long as one config server fails, the metadata of the cluster will be read-only and read-write to shards, but chunks separation and migration will not work, knowing that three config servers are all valid. If three config servers are invalidated, it means that the cluster will not be able to read the metadata data and if the MONGOs is restarted then it will not be able to obtain metadata data and will not be able to provide router until config servers is valid. In addition, the amount of metadata data is very small, so this does not bring storage pressure on config servers or mongos. Config server is very small, it requires a low hardware configuration, and requires less memory and storage space.

In a prodution environment, you need three config servers, and if you only need one for testing purposes.

Note: The MongoDB 3.2+ version finally adjusts the Config Servers deployment mode (there is always an endless "why three config Servers required"), giving up the requirement to use three config Servers; config Servers can take the "Replica set" schema pattern and must use the Wiredtiger storage engine. This adjustment can effectively enhance the Config servers data consistency, may take advantage of the replica set architecture, the number of config servers can be extended to 50 nodes. However, replica set cannot have members of the "arbiters", "delayed" type, and their "buildindexes" must be set to true.

3, MONGOs:routers, itself does not save any user data, responsible for forwarding the client's read and write requests, the results of the Shard collection, running balancer process, tracking split and so on ; Usually we deploy a mongos on each application node, because MONGOs consumes very little memory, hardly consumes disk, it only needs to consume a certain amount of memory, CPU is used to process data, in addition, the deployment of application and MONGOs communication is the shortest distance and high efficiency. You may want to build a proxy or load balancer between applications and MONGOs, which is difficult to implement, and may cause many problems, and the proxy needs to be able to decode the MongoDB data protocol. For read, write operations, clients usually randomly select a MONGOs, which provides a simple load balancing, but for a cursor read operation, the request is sent to only one mongos during cursor traversal. Because only that MONGOs holds CUROSR information.

If you specify Sort,mongos in query to pass the $orderby parameter to the selected shards (chunks and shards are selected according to Shard Key), the primary of this database is available shard Responsible for receiving and merge the results of each shard sort and returning the results to the client via MONGOs. If query specifies limit (), then MONGOs passes limit to the specified shards, and each shards limits the number of bars returned by the data to limit, resulting in a possible greater number of data bars than MONGOs. So MONGOs needs to apply limit again before returning the results to the client. If the client uses Skip (), MONGOs will not pass the skip parameter to shards because it is not helpful for the result filter, MONGOs will receive the data that shards has not yet been skip, and then skip and assemble the data back to the client, mainly because Every shard moment will have a new data insertion, so mongos cannot calculate where to skip in advance. If skip and limit are used at the same time, this is slightly simpler, MONGOs will limit + skip and, as limit passed to shards, and then, like Skip, perform the local skip again, to improve the performance of the query. This also requires women to specify limit as much as possible to improve efficiency when skip operations are required, and in sharding environments, using sort is often a relatively expensive operation (although the Shard key index is ordered).

For queries, update, remove, and aggregation methods that do not specify Shard Key, MONGOs broadcasts the action to all shards. For the above sharded collection, their data is saved on the primary shard, although application can directly link this shard to get the data, but for the coordination of cluster data access, We suggest still using MONGOs as router.

Second, shard key (fragment keys)

Shark key can determine the distribution of collection data in the cluster, shard key must be an indexed field or a left prefix of a combined index. When documents are inserted successfully, no update operation can modify the Shard key or throw an exception. We can't put

"Multikey index" as Shard key.

Hashed type partitioning key, can only establish hashed index to a single filed, so select the partitioning key need to be very careful, preferably it has a better "dimension" (cardinality, cardinality), that is, this field has less duplicate values The monotonically incremented field values are a good choice, such as Objectid or timestamp, as hashed partitioning keys. If you use the Hashed fragment key for an empty collection, the MongoDB automatically creates 2 empty chunks on each shard node by default, but we can specify in the shardcollection directive " Numinitialchunks parameter to qualify the number of initialization chunks.

When selecting Shard Key, you need to take into account the requirements of the application, the reading and writing ratio, and how to read the data. If cluster has a larger write request, very few read or update, then the Shard key needs to pay attention to the separation of write pressure, so as to spread the write operation on multiple shards, such as the use of hashed partition, Use Objectid (monotonically incremented) as the Shard key. If the read is large and there are fewer write, you need to consider the way read is, and if it is usually a range query (such as timestamp > some time), then you need to use the range partition + the monotonically increasing Shard key method (timestamp) , a hash partition can achieve better performance if the usual query is usually matched by an "equal" comparison.

The efficient way to query is MONGOs simply forward the request to a single shard, instead, if the query does not specify Shard Key,mongos will forward the request to all shards and wait for them to return the result, this "Scatter/gather" This can result in a long operation, usually appearing in the aggregation method. If the Full Shard key field (possibly a key combination) is specified when the query is made, MONGOs will only route the request to a shard, and if the query specifies the leftmost prefix of the Shard key field, then MONGOs may route the request to a few shards. And the more fields covered shard key, the less the number of Shard to participate in the query, this principle and the characteristics of the index are very similar; for example Shard key is {"ZipCode": 1, "name": 1, "Age": 1}, then the query condition is {"ZipCode" : "10010", "Name": 1} will be more performance than using only "ZipCode" queries, and fewer shard to participate in queries.

In general, we should use a combination field as the Shard key, unless we can determine that a single field value is "unique, not duplicate" will use a single field as the Shard key, the final combination field must be able to improve cardinality (reduce duplicate values), This is a great help to the chunk division.

iii. mechanism of sharding

1, balancing: If a shard on the chunks than other shard more, that is, unbalanced state, then MONGOs will automatically migrate to the chunks to achieve balance, the balancing process will not affect the user's data operations. Any MONGOs instance in the cluster can start the balancing thread, the default balancer is open, and a lock table is in the Config Database (config servers), and when the balancer is active, The corresponding MONGOs will try to get "lock" by modifying the document, and this mongos is responsible for balancing work if the lock succeeds. It is to be noted that the local system time of the MONGOs instance has an impact on the lock mechanism and requires consistent time (NTPD instructions) for all MONGOs (including all Shards, config servers in the cluster).

Balancer migrates chunks from the Shard holding chunks most to the chunks with the least lasting shard, migrating one at a time until the cluster is relatively balanced (up to a minimum of 2). Chunks migrations may consume disk space, and those that have migrated chunks are not immediately deleted, but archived to a specific directory, the Archive (archive) feature is turned on by default, and the migration consumes a certain amount of network bandwidth and may affect performance, IO throughput that affects user operations. It is recommended that you migrate one chunk at a time, and only when the difference between the "maximum" and "minimum" reaches threshold, or specify a time interval, balancer will only migrate chunks during this time period.

threshold: Minimize the impact of balancing on the cluster, only when the shards on the "maximum" and "minimum" chunks the number of difference reached the threshold, the chunks distribution will be rebalanced. Threshold value is currently not modified, when chunks total < 20 o'clock, this value is 2, total >= 80 o'clock, this value is 8, others are 4. Once the balancing work is started, it will stop only when the chunks distribution is balanced, that is, the difference between "Max" and "min" is not greater than 2.

By default, MongoDB consumes as much free disk space as possible, so we need to focus on the MongoDB disk consumption, but when adding shard nodes to the cluster, you can specify the maximum amount of disk space (max size) that the current Shard allows. When the Shard disk consumption reaches the maximum, balancer will not migrate chunks to it, but this does not affect the continued write operation on this shard. (see Addshard directives below)

2, chunks migration process: balancer to Source shard send "movechunk" instructions. (see below Movechunk) source shard start move specifies chunk; During migration, user actions are still route to Source Shard, which is still responsible for the read, write operations on this chunk. Destination Shard does not have the indexes required for source, the corresponding index is built at this time. Destination Shard begins to request documents in chunk and saves them locally. During this time, there may be data changes on this chunk, and when chunk data is sent, Destination Shard will synchronize the change data. When the synchronization is complete, destination shard will establish a link with config servers and update the location information for this chunk in Medata. This period source blocks the write operation of the client. Thereafter, the read, write requests will be route to the new Shard, and if there is no cursor on the old chunk, source Shard will delete the chunk. (the default is to move to the archive directory, located under the "Movechunk" directory under DBPath)

The final step is the source shard waiting for cursor to close and delete the chunk, known as the "delete phase", but balancer can not wait until it is over to start the next chunk migration, to some extent improve the efficiency of the migration, Can let chunks data migrate as soon as possible, the cluster as soon as possible to achieve equilibrium. Sometimes the "delete phase" may take a long time, so we can specify that the "_waitfordelete" parameter indicates the maximum time to wait for the "delete phase", after which balancer will discard the wait and start migrating to the next chunk.

If the migrated chunk size has exceeded the set value or the number of documents held by it exceeds the maximum value (see this article), it will not be migrated and need to wait to be split before migrating.

_secondarythrottle (throttle, valve): Usually each shard is a replica set structure, for chunk migration, in fact, is destination bulk read documents in source and write to replica Set procedure (primary), this involves the "write concern" question, which is how many secondaries writes write to it before returning. The "_secondarythrottle" parameter in the sharding environment is used to control this attribute, which defaults to true, indicating at least synchronizing to a secondary, semantically equivalent to {w:2} in write concern, Only the documents in chunks are synchronized to at least one secondary before migrating to the next chunk; You can set this value to false, that is, to close the valve, the default effect is equal to {w:1}, but at this point we can also specify "write Concern "parameter indicates that documents need to be synchronized to multiple secondaries. (See below for operational dimensions)

3, Spit: The default size of each chunk is 64M, we can adjust this value, smaller chunk can make the distribution of data more balanced, easy to migrate, but the problem is that split more frequently, Also increases the cost of MONGOs routing (the amount of data held per chunk is small, each query means more chunk to be accessed), and larger chunk is not easy to migrate, but less split, less metadata information, MONGOs routing is simple, But if chunk, the data distribution is uneven. However, personally think that 64M is still too small, the proposed increase to 256M.

Because the spit operation will only have inserts or update triggers.

4, Shark key indexes: The previous article has learned that a collection to open sharding need to specify the Shard key, but before that, you need to create a Shard key field at the beginning of the index. For example Shard key is {"ZipCode": 1, "username": 1}, then need to create so {"ZipCode": 1, "username": 1} or {"ZipCode": 1, "username": 1, "others": 1 ...}.

Iv. Deployment

We build a sharding test environment for our organization, and the node deployment list is as follows:

1) Shard:2, the port is 27018, 28018, single node respectively. (hint, online environment, at least 2 shards, and each shard is replica set structure)

2 config server:1, port is 27019. (hint, online environment, must be three config servers)

3) MONGOs: One, Port is 27017. (hint, online environment, deployed with application node, usually multiple)

Note that we ensure that all nodes of the same type have the same configuration (except ports, file paths) to avoid problems. The following configuration is based on the "test environment" and, if it is production, you need to set "Smallfiles" to true.

1. Config Server Deployment Java Code systemLog:quiet:false Path:/data/configdb/logs/mongod.log logappend:false destination : File processManagement:fork:true pidfilepath:/data/configdb/mongod.pid net:bindip:127.0.0.1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More