The internal mechanism of the MongoDB replica set (for reference lanceyan.com)

Source: Internet
Author: User
Tags failover

The following are some guiding questions for MongoDB's internal mechanism:

    • Replica set failover, how is the master node elected? The ability to manually interfere with a single master node.
    • The official says the number of replicas is best to be odd, why?
    • How is the MongoDB replica set synchronized? What happens if the synchronization is not timely? Will there be inconsistencies?
    • Will mongodb failover automatically happen without a reason? What conditions will trigger? Frequent triggering can lead to heavier system load?

The bully algorithm MongoDB replica set failover feature benefits from its electoral mechanism. The election mechanism uses the bully algorithm, which makes it easy to select the master node from the distributed node. A distributed cluster architecture typically has a so-called Master node, which can be used for many purposes, such as caching machine node metadata, accessing the cluster as a gateway, and so on. Master node Yes, well, why do we want bully algorithm? To understand this, let's look at both of these architectures:

    1. Specifies the schema of the master node, which generally declares a node as the primary node, and the other nodes are slave nodes, as we commonly use for MySQL. But we're talking about the whole cluster in the first section. If the master node hangs up, it has to be manually operated, a new master node is on the shelf, or the data is recovered from the node, which is not very flexible.

    2. No primary node is specified, and any node in the cluster can become the primary node. MongoDB is the use of this architecture, one but the master node hangs the other from the node automatically replaced into the master node. Such as:

OK, the problem is in this place, since all nodes are the same, but the main node is hung, how to choose the next node is who to do the main node? This is the problem that the bully algorithm solves.

What is the bully algorithm, thebully algorithm is a coordinator (master node) campaign algorithm, the main idea is that each member of the cluster can declare that it is the primary node and notify other nodes. Other nodes can choose to accept this claim or reject it and enter the master node to compete. A node that is accepted by all other nodes can become the primary node. The node follows some attributes to determine who should win. This property can be a static ID, or it can be an updated metric like the last transaction ID (the newest node wins). For details, refer to the coordinator of the NoSQL Database distributed algorithm campaign and Wikipedia's explanation.

How does the election of MongoDB take place? Official Description:

We use a consensus protocol to pick a primary. Exact details would be spared here but the basic process is:

  1. Get maxlocalopordinal from each server.
  2. If a majority of servers is not up (from the this server's POV), remain in secondary mode and stop.
  3. If the last op time is seems very old, stop and await human intervention.
  4. else, using a consensus protocol, pick the server with the highest maxlocalopordinal as the Primary.

Roughly translated to select the master node for the use of a consistent protocol. The basic steps are:

    1. Get the last action timestamp for each server node. Each mongodb has a oplog mechanism to record the operation of the machine, convenient and the primary server to compare whether data synchronization can also be used for error recovery.
    2. If most of the servers in the cluster are down, keep the live nodes in secondary state and stop, not elected.
    3. If the elected master node in the cluster or all the last synchronization time from the node looks very old, stop the election waiting for the operation.
    4. If none of the above is a problem, select the server node that last action timestamp is the latest (ensure data is up-to-date) as the primary node.

A consistent protocol (in fact, the bully algorithm) is mentioned here, and there are some differences between this and the consistency protocol of the database, and the consensus agreement mainly emphasizes that there are some mechanisms to ensure consensus, while the consistency protocol emphasizes the sequential consistency of operations, such as reading and writing a data at the same time whether there is dirty data. There is a classic algorithm called "Paxos algorithm" in the distributed protocol, followed by the introduction.

There is a problem, that is, all the last operating time from the node is the same? Who first becomes the master node is the fastest time to choose who.

election trigger conditions the election is not the moment will be triggered, the following conditions can be triggered.

    1. When a replica set is initialized.
    2. The replica set and the primary node are disconnected and may be a network problem.
    3. The primary node hangs.

The election also has a precondition that the number of nodes participating in the election must be greater than half the number of nodes in the replica lumped, and if it is less than half, all nodes remain read-only.
The log will appear:

Can ' t see a majority of the set, relinquishing primary

Can the primary node be hung out to be human intervention? The answer is yes.

  1. You can use the Replsetstepdown command to lower the master node. This command can be logged in to the master node using Db.admincommand ({replsetstepdown:1})

    You can use the force switch if you can't kill it.

    Db.admincommand ({replsetstepdown:1, force:true})

    or using Rs.stepdown (120) can also achieve the same effect, the middle of the number refers to the time to stop the service to become the primary node, in seconds.

  2. Setting a Slave node has a higher priority than the primary node.
    First look at the priority in the current cluster, through the rs.conf () command, the default priority is 1 is not displayed, here is marked.

    Rs.conf (); {
    "_id": "Rs0",
    "Version": 9,
    "Members": [
    {
    "_id": 0,
    "Host": "192.168.1.136:27017"},
    {
    "_id": 1,
    "Host": "192.168.1.137:27017"},
    {
    "_id": 2,
    "Host": "192.168.1.138:27017"}
    ]
    }

    Let's set it up so that a host with ID 1 can take precedence as the primary node.

    CFG = rs.conf ()
    cfg.members[0].priority = 1
    Cfg.members[1].priority = 2
    cfg.members[2].priority = 1
    Rs.reconfig (CFG)

    Then execute the rs.conf () command to see that the priority has been set successfully, and the primary node election will also be triggered.

    {
    "_id": "Rs0",
    "Version": 9,
    "Members": [
    {
    "_id": 0,
    "Host": "192.168.1.136:27017"},
    {
    "_id": 1,
    "Host": "192.168.1.137:27017",
    "Priority": 2
    },
    {
    "_id": 2,
    "Host": "192.168.1.138:27017"}
    ]
    }

    What can I do if I don't want a node to become the primary node?
    A, using Rs.freeze (120) to freeze the specified number of seconds cannot be elected as the master node.
    b, according to the previous article set the node as the non-voting type.

  3. When the primary node cannot communicate with most of the slave nodes. Unplug the host node network cable, hehe:)

    Priority can also be used, if we do not want to set what hidden node, use the secondary type as a backup node and do not want him to become the master node? See, a total of three nodes distributed in two data centers, data Center 2 node Set priority 0 cannot become the primary node, but can participate in elections, data replication. The architecture is still very flexible!

Odd number of official recommended replica sets are odd, up to 12 replica set nodes, and up to 7 nodes to participate in the election. Up to 12 replica set nodes because there is no need for a copy of the data so many copies, too many backups increase the network load and slow down the performance of the cluster, and a maximum of 7 nodes to participate in the election because the internal election mechanism of the number of nodes too much will lead to the selection of the main node within 1 minutes, everything as appropriate. This "12", "7" number is OK, through their official performance test definition can be understood. See also the official documentation "MongoDB Limits and thresholds" for specific restrictions. But it's never been understood. Why the whole cluster is odd, the number of test clusters is even can be run, refer to this article http://www.itpub.net/thread-1740982-1-1.html. Then suddenly read a StackOverflow article finally Epiphany, MongoDB itself is designed to be a cross-IDC distributed database, so we should put it into a large environment to see.

Assume that four nodes are divided into two IDC, each IDC two machines, such as. However, there is a problem, if the two IDC network is broken, this is very prone to problems on the WAN, in the above election mentioned as long as the primary node and the majority of nodes in the cluster disconnects will start a new round of election operations, but the MongoDB replica set on both sides have only two nodes, But the number of nodes required to participate in the election must be greater than half, so that all cluster nodes are not able to participate in the election, only in read-only state. But if it is an odd number of nodes will not appear this problem, assuming 3 nodes, as long as 2 nodes alive can be elected, 5 of 3, 7 of 4 ...

Heartbeat In summary, the entire cluster needs to maintain a certain amount of communication to know which nodes are alive and which nodes are hanging off. The MongoDB node sends the pings packet once every two seconds to the other nodes in the replica set, and if the other node does not return within 10 seconds, it is marked as unreachable. A State mapping table is maintained inside each node to indicate what the current role, log timestamp, and other key information is for each node. If it is a master node, in addition to maintaining the mapping table, you also need to check whether you can communicate with most of the nodes in the cluster, if you cannot downgrade yourself to a secondary read-only node.

synchronization , replica set synchronization is divided into initialization synchronization and keep replication. Initialization synchronization refers to the full amount of synchronization data from the master node, if the primary node data volume is longer than the large synchronization time. While keep replication refers to the synchronization of the initialization, the real-time synchronization between nodes is generally incremental synchronization. Initialization synchronization is not only punished for the first time, there are two scenarios that trigger:

    1. Secondary for the first time, this is for sure.
    2. Secondary the amount of data behind the Oplog is larger than the size of the volume, so it will be fully copied.

What is the size of the Oplog? Previously said Oplog saved the operation of the data record, secondary copy Oplog and the inside of the operation in the secondary executed again. But Oplog is also a collection of MongoDB, stored in the local.oplog.rs, but this oplog is a capped collection is a fixed-size collection, new data added to the size of the collection will be overwritten. Therefore, it is important to note that the replication across the IDC should be set up with the appropriate oplogsize to avoid frequent full-scale replication in the production environment. Oplogsize can be set by –oplogsize size, for Linux and Windows 64 bit, Oplog size defaults to 5% of the remaining disk space.

Synchronization is not only from the primary node synchronization, assuming that the cluster of 3 nodes, Node 1 is the primary node in IDC1, Node 2, node 3 in IDC2, initialize Node 2, node 3 will synchronize data from Node 1. The following node 2, node 3, uses the proximity principle to replicate from the current set of IDC replicas, as long as one node replicates data from Node 1 of IDC1.

There are several things to note about setting up synchronization:

      1. Secondary does not replicate data from delayed and hidden members.
      2. As long as it is necessary to synchronize, the buildindexes of the two members must be the same regardless of whether it is true or false. Buildindexes is used primarily to set whether data for this node is used for queries, which is true by default.
      3. If the synchronization operation does not respond for 30 seconds, a node is re-selected for synchronization.

The internal mechanism of the MongoDB replica set (for reference lanceyan.com)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.