MongoDB's replication set has the function of automatically tolerating partial node downtime, which triggers an election-related process when the replication set is in trouble, and automatically switches the master-slave node.
Each replica set member runs in the background with the heart jumper thread of all nodes in the replica set, which in both cases triggers the state detection process:
- Replica set member heartbeat detection results change, such as a node hangs or a new node.
- Over 4s no state detection process is performed.
The status detection process generally involves the following steps:
- Check if it is in the election process, and if so, exit the process.
- Maintain an alternate list of primary nodes, all nodes in the list may be elected as the primary node, and each node detects itself and whether the global condition satisfies:
- See if there is majority online in the replication set.
- Its priority is greater than 0.
- itself is not a arbiter.
- Self-optime can not lag behind the latest node 10s or more.
- The cluster program stored by itself is updated by information.
If all the conditions are met, it is added to the Master node alternate list, otherwise it is removed from the list.
- The following conditions are detected, if all are satisfied, the primary node will be the slave node (if the primary node to be demoted is itself, call the demote method directly, if not for itself, call the Replsetstepdown command to demote the replica set master node to the slave node.):
- The primary node in the cluster exists.
- A node with a higher priority than the current primary node exists in the alternate list of master nodes.
- The top priority node in the standby list for the master node has a optime that is less than 10s behind the newest optime of all other nodes.
- Detection of whether the main, if the main, and itself can not see the copy set of majority online, downgraded itself to from.
- If the primary node in the cluster is not visible, detect if it is in the "standby list" of the master node, and if not, print log and exit the process.
- If you are in the "standby list" of the master node, you begin to determine whether you can send a notification to the replication set that is the main node of the election, and the judgment process includes:
- Whether or not you can see the majority online in the replication set.
- Whether it is in the standby list of the master node itself.
If the condition is met, the setting "itself is already in the electoral process" is the identity bit true and enters the "elect itself as the primary node" method.
- Method verifies that the following conditions are true:
- This thread has got the lock on the wire.
- This node is not configured with the Slavedelay option or the configured Slavedelay is 0.
- This node is not configured as arbiter.
If satisfied, then call the environment detection, if the following conditions are triggered, do not send the "Election I master node" vote:
- The current time is less than the end freeze time of Steppeddown (the time to execute Steppeddown + freeze set time, internal call is 60s).
- Their optime are not the newest of all nodes.
- If the node optime is newer than itself, exit this process directly.
- If the other newest nodes are as new as themselves, each with one such node, random sleep for a period of time, and then continue to judge.
- On-line 5 minutes and not all nodes are online in the replication set.
- If there is no other problem, try to get the votes in your own vote, in the process, you will determine whether you have a vote in 30s, such as the past, directly out of the process.
- After all these complicated tests, we can finally send the vote of "elect me to the main node" to the copy set.
- After sending, will receive votes from all nodes, if the number of votes is less than equal to half, do not change themselves to master node, if more than half, set their own primary node.
After the poll is over, set the "itself already in the election process" identity bit false.
It can be seen that some of the above judgment logic is repeated judgment, but does not affect the final result, may be more complex with the judgment logic, before each decision to verify that all conditions are satisfied, to prevent the condition is missing.
When the node in the replication set receives the "Elect Me Master" poll message sent by another node, the following judgments are given:
- If the copy set configured by itself is too low, do not vote.
- If the replication set configuration version for the requested node store is too low, vote No.
- If the replica set in which it is located does not have a node to initiate a poll, vote No.
- The primary node exists in the replication set and is voted against.
- If the node that can participate in the election has priority higher than the main request node exists, vote against.
If all the conditions are passed, the number of votes is obtained (the same will be judged whether they have voted in 30s, if they have participated, no longer vote).
What needs to be said is that an objection will reduce the final number of votes by 10000, that is, in most cases, the requested node cannot become the primary node as long as there is a node objection.
The electoral process is complex and the actual use is summed up in two points:
- In general, it takes about 5s to select the master.
- If the newly elected master node immediately hangs, it will take at least 30s time to re-elect the master.
Transfer from MongoDB Chinese community
MongoDB election process