A properly configured Mongodb sharding cluster does not have a single point of failure.
This article describes several potential node failure scenarios in a sharded cluster and how Mongodb handles these node faults.
1. Mongos node down
A Mongos process should run on each application server, which should exclusively occupy the Mongos process and communicate with the sharded cluster through it.
Mongos processes are not persistent. On the contrary, they collect all required configuration information from the Config Server at startup.
This indicates that the failure of any application server node does not affect the overall sharding cluster, and all other application servers will continue to work normally.
In this case, recovery is quite simple. We only need to start a new application server and a new Mongos process.
2. A Mongod node in the shard goes down.
Each shard consists of n servers, which are configured as a replica set ). If any node in the replica set goes down, read and write operations on the shard are allowed.
More importantly, data on the down server will not be lost, because the replication mechanism has an option that forces the copy write operation to be performed on other nodes of the shard and then returns the data, this is similar to setting write = 2 on Dynamo.
Replica sets are available in Versions later than MongoDB v1.6.
3. All Mongod nodes in the shard are down
If all nodes (replicas) in a shard are down, data in the shard cannot be accessed. However, the operation continues, but is shared by other parts. You can see why.
If a shard is configured as a replica set, At least one member should be in another data center. In this case, the entire Shard is down almost impossible. We recommend this configuration for greater redundancy.
4. A Config Server goes down
A product-level sharding cluster requires three Config Server processes, each of which runs on an independent machine. Write operations on the cluster metadata in Config server use a two-phase commit to ensure that it is an atomic and replicated transaction operation.
When any configuration server fails, the metadata of the Mongodb cluster will become read-only. The cluster system continues to run, but chunks cannot be split or migrated across shards in one shard. For most use cases, this will not cause problems. It is necessary to change the Chunk metadata not frequently.
In addition, it is important to make the down Config Server recover within a reasonable time period (one day), so as to avoid load imbalance due to the lack of migration (relatively speaking, for most product scenarios, this phenomenon is not very serious ).