How mongodb clusters build replica sets

Last Update:2017-01-13 Source: Internet

Author: User

Tags failover mongodb

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How is the master node elected for replica set failover? Can I manually interfere with the Dismounting of a master node.

Officially, the number of replica sets is preferably an odd number. Why?

How is the mongodb replica set synchronized? What happens if the synchronization is not timely? Will there be inconsistency?

Will mongodb failover automatically happen for no reason? What conditions will be triggered? Frequent triggering may increase system load?

The Bully algorithm mongodb replica set failover function benefits from its election mechanism. The election mechanism uses the Bully algorithm to easily select the master node from the distributed node. In a distributed cluster architecture, there is generally a so-called master node, which can be used for many purposes, such as caching machine node metadata and serving as the cluster access portal. If the master node exists, why do we need the Bully algorithm? To understand this, let's first look at these two architectures:

Specify the architecture of the master node. This architecture usually states that one node is the master node, and other nodes are slave nodes, such as mysql. However, in the first section of this architecture, we said that if the master node of the entire cluster fails, manual operations are required. Mounting a new master node or restoring data from the Slave node are not flexible.

If no master node is specified, any node in the cluster can become the master node. Mongodb adopts this architecture. Once the master node fails, the Slave node automatically becomes the master node. As shown in the following figure:

Well, the problem is in this place. Since all nodes are the same, once the master node fails, how can we choose who is the next node as the master node? This is the problem solved by the Bully algorithm.

What is the Bully algorithm? The Bully algorithm is a coordinator (master node) election algorithm. The main idea is that every member of the cluster can declare it as the master node and notify other nodes. Other nodes can choose to accept the claim or reject and enter the master node for competition. Only nodes accepted by all other nodes can become the master node. Nodes determine who should win based on certain attributes. This attribute can be a static ID or an updated metric like the last transaction ID (the latest node will win ). For more information, see The NoSQL database distributed algorithm Coordinator election and Wikipedia.

How is mongodb elected? Official description:

We use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:

Get maxLocalOpOrdinal from each server.
If a majority of servers are not up (from this server's POV), remain in Secondary mode and stop.
If the last op time seems very old, stop and await human intervention.
Else, using a consensus protocol, pick the server with the highest maxLocalOpOrdinal as the Primary.

Select the master node for consistent protocol. The basic steps are as follows:

Obtain the last operation timestamp of each server node. Each apsaradb for mongodb has an oplog mechanism to record local operations, so that you can easily compare data synchronization with the master server and use it for error recovery.

If most servers in the cluster are down, the active nodes are in the secondary state and stopped.

If the last synchronization time of the primary node or all slave nodes in the cluster looks old, stop the election waiting for the person.

If there is no problem above, select the server node with the latest operation timestamp (ensure the data is up-to-date) as the master node.

Here we mention a consensus protocol (actually the bully algorithm), which is somewhat different from the database consistency protocol. The consensus protocol mainly emphasizes that consensus can be achieved through some mechanisms; the consistency protocol emphasizes the operation sequence consistency, for example, whether dirty data will occur when a data is read and written at the same time. In the distributed protocol, there is a classic algorithm called "Paxos algorithm", which will be introduced later.

The above question is that the last operation time of all slave nodes is the same. What should I do? It is the fastest time to become the master node.

The election trigger condition is not triggered at any time. The following conditions can be triggered.

Initialize a replica set.

The replica set and master node are disconnected, which may be caused by network problems.

The master node fails.

There is also a precondition for the election. The number of nodes participating in the election must be more than half of the total number of nodes in the replica set. If the number is less than half, all nodes are read-only.

The following logs are displayed:

Can't see a majority of the set, relinquishing primary

Can manual intervention be performed if the master node fails? The answer is yes.

You can run the replSetStepDown command to remove the master node. This command can log on to the master node and use db. adminCommand ({replSetStepDown: 1 })

You can use the Force switch if it cannot be killed.

Db. adminCommand ({replSetStepDown: 1, force: true })

You can also use rs. stepDown (120) to achieve the same effect. The number in the middle means that the master node cannot be stopped during this period, and the unit is seconds.

Setting a Slave node has a higher priority than the master node.

First, check the priority in the current cluster. Through the rs. conf () command, the default priority is 1, which is not displayed here.

Rs. conf ();{

{
"_ Id": "rs0 ",
"Version": 9,
"Members ":[
            {
"_ Id": 0,
"Host": "192.168.1.136: 27017 "},
            {
"_ Id": 1,
"Host": "192.168.1.20.: 27017 "},
            {
"_ Id": 2,
"Host": "192.168.1.138: 27017 "}
    ]
    }

Let's set it so that the host with id 1 can become the master node first.

Cfg = rs. conf ()
Cfg. members [0]. priority = 1
Cfg. members [1]. priority = 2
Cfg. members [2]. priority = 1
Rs. reconfig (cfg)

Then run the rs. conf () command to check whether the priority has been set successfully. The master node election is also triggered.

{
"_ Id": "rs0 ",
"Version": 9,
"Members ":[
            {
"_ Id": 0,
"Host": "192.168.1.136: 27017 "},
            {
"_ Id": 1,
"Host": "192.168.1.20.: 27017 ",
"Priority": 2
},
            {
"_ Id": 2,
"Host": "192.168.1.138: 27017 "}
      ]
     }

What can I do if I don't want a Slave node to become the master node?

A. Using rs. freeze (120) to freeze a specified number of seconds cannot be elected as the master node.

B. Set the node type as Non-Voting according to the previous article.

When the master node cannot communicate with most slave nodes. Unplug the network cable of the host node :)

This can also be used as a priority. If we do not want to set any hidden node, we will use the secondary type as the backup node and do not want it to become the master node. What should we do? As shown in the figure below, a total of three nodes are distributed in two data centers. The priority of node 2 is set to 0, which cannot be the master node, but can participate in election and data replication. The architecture is flexible!

The number of members of the odd-number officially recommended replica set is odd. A maximum of 12 replica set nodes can be selected. A maximum of 7 nodes can be selected. A maximum of 12 replica set nodes are because there is no need to copy so many copies of data. Too many backups increase network load and slow down cluster performance; A maximum of seven nodes can participate in the election because the number of nodes in the internal election mechanism is too large, so that the master node cannot be selected within one minute, as long as everything is appropriate. The numbers "12" and "7" are okay and can be understood through their official performance tests. For more information about the restrictions, see MongoDB Limits and Thresholds. However, I have never figured out why the entire cluster is odd. You can run it by testing that the number of clusters is even. Refer to this article http://www.itpub.net/thread-1740982-1-1.html. Later, I suddenly read an article about stackoverflow and finally realized that mongodb designed a distributed database that can be used across IDCs, so we should put it in a big environment.

Assume that the four nodes are divided into two IDCs, each of which has two machines, as shown in the following figure. However, a problem occurs. If the two IDCs are disconnected, this problem may easily occur on the wide area network, as mentioned in the above election, as long as the master node is disconnected from most of the nodes in the cluster, a new round of election will begin. However, there are only two nodes on both sides of the mongodb replica set, however, the number of nodes involved in the election must be greater than half, so that all cluster nodes cannot participate in the election and will only be read-only. However, if it is an odd number of nodes, this problem will not occur. For example, if there are three nodes, as long as there are two nodes alive, the election can be done, three of the five nodes, and four of the seven nodes...

In summary, the whole cluster needs to maintain a certain degree of communication to know which nodes are alive and which nodes are down. The mongodb node sends a pings packet every two seconds to other nodes in the replica set. If no response is returned from other nodes within 10 seconds, the packet is marked as inaccessible. Each node maintains a state ing table to indicate the role, log timestamp, and other key information of each node. If it is the master node, in addition to maintaining the ing table, you also need to check whether you can communicate with most nodes in the cluster. If not, downgrade yourself to a secondary read-only node.

Synchronization: replica set synchronization is divided into initialization synchronization and keep replication. Initialization synchronization refers to the full synchronization of data from the master node. If the data volume on the master node is large, the synchronization time will be relatively long. Keep replication means that after initialization and synchronization, real-time synchronization between nodes is generally incremental synchronization. Initialization of synchronization is not only penalized for the first time, but triggered in either of the following two cases:

Secondary joins us for the first time.

The data size behind secondary exceeds the oplog size, which will also be fully replicated.

What is the oplog size? As mentioned above, oplog stores data operation records. secondary copies oplog and executes the operations in secondary. However, oplog is also a collection of mongodb, which is stored in local. oplog. rs, but this oplog is a capped collection, that is, a set of fixed sizes. Adding new data exceeds the size of the set will overwrite. Therefore, it is worth noting that the oplogSize should be set for cross-IDC replication to avoid frequent full replication in the production environment. OplogSize can be set through-oplogSize. For 64-bit linux and windows, the oplog size is 5% of the remaining disk space by default.

Synchronization is not only synchronized from the master node. Assume that three nodes in the cluster, Node 1 is the master node in IDC1, Node 2 and Node 3 in IDC2, initialize node 2 and Node 3 to synchronize data from node 1. Node 2 and Node 3 will use the proximity principle to replicate data from the current IDC replica set, as long as one node replicates data from node 1 of IDC1.

Pay attention to the following points when setting synchronization:

Secondary does not copy data from delayed and hidden members.

As long as synchronization is required, the buildindexes of the two members must be the same, whether it is true or false. Buildindexes is mainly used to set whether the data of this node is used for query. The default value is true.

If the synchronization operation does not respond for 30 seconds, a node is re-selected for synchronization.

At this point, all the problems mentioned earlier in this chapter have been solved. I have to say that the mongodb design is really powerful!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More