Yan Lan: build a highly available MongoDB cluster (2)

Source: Internet
Author: User
Keywords Election can if this
Tags aliyun company copy data data synchronization distributed high high performance

http://www.aliyun.com/zixun/aggregation/13461.html"> MongoDB, formerly known as 10gen, was founded in 2007 and after receiving a total of $ 231 million in 2013, the company's market capitalization has risen to 10 One hundred million dollars level, the height of 20 years of well-known open-source company Red Hat (founded in 1993) results.

High performance and easy scalability have been the cornerstone of MongoDB, with standardized documentation and interfaces that make it popular with users, as evident from analyzing the results of DB-Engines' scoring - it took only a year for MongoDB to complete Rising from seventh place to fifth place, scoring rose from 124 to 214 points, up twice the value of fourth-place PotgreSQL, while the current rankings with PostgreSQL is only 16 points less than.

The speed at which MongoDB can grow is largely due to the fact that many traditional relational databases are no longer able to handle the scalability needs of today's data processing, though they are proven and have good performance and stability. However, different from the previous use of many NoSQL have their own limitations, which also led to the difficult entry. Here we share with you Yanlan Bowen - how to build efficient MongoDB cluster.

Previously, we had shared the first part of this series of blog posts, where we will share the second part - deep inside the copy set mechanism and fragmentation.

The following is Bowen:

In-depth copy of the internal mechanism

The first part of the series describes the configuration of the replica set, and this part takes a deep dive inside the replica set's internal mechanism. Or with a copy of the issue!

Copy set failover, the main node is how to vote? Can you manually intervene in one of the main nodes?

The official said the number of copies is the best odd number, and why?

MongDB copy set is how to synchronize? If the synchronization is not timely what happens? Will there be inconsistencies?

MongDB failover will happen automatically for no reason? What conditions will trigger? Frequent trigger may bring system load aggravate?

Bully algorithm

MongDB copy set failover thanks to its election mechanism. The election mechanism uses the Bully algorithm, which makes it easy to choose the primary node from the distributed nodes. A distributed cluster architecture generally has a so-called primary node, which can be used for many purposes, such as caching machine node metadata, accessing the cluster as an entry point, and so on. The main node is there, what are we doing Bully algorithm? To understand this we first look at these two architectures:

Designated the main node architecture, this architecture will generally declare a node as the main node, the other nodes are from the node, as we commonly used MySQL is the case. However, this architecture we said in the first section of the entire cluster if the main node is hung up, you have to manually operate, shelve a new master node or restore data from the node, less flexible.

Do not specify the master node, any node in the cluster can become the master node. MongoDB is also used in this architecture, but the main node linked to the other automatically from the node to become the main node. As shown below:

Well, the problem is in this place, since all the nodes are the same, but the main node hung up, how to determine the next master node? This is the Bully algorithm to solve the problem.

What is Bully algorithm, Bully algorithm is a coordinator (master node) election algorithm, the main idea is that each member of the cluster can declare that it is the master node and notify other nodes. Other nodes can choose to accept this claim or refuse and enter the main competition. The node accepted by all other nodes can become the master node. Node according to some attributes to determine who should win. This property can be a static ID, or it can be an updated metric like the last transaction ID (the newest node will win).

election

So, MongDB is how to vote? The official description:

We use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:

get maxLocalOpOrdinal from each server.

if a majority of servers are not up (from this server's POV), remain in Secondary mode and stop.

if the last op time seems very old, stop and await human intervention.

else, using a consensus protocol, pick the server with the highest maxLocalOpOrdinal as the Primary.

Basically translate it to choose the primary node for a consistent protocol. The basic steps are:

Get the last operation timestamp for each server node. Each MongDB has an oplog mechanism to record native operations for easy synchronization with the primary server for data synchronization and also for error recovery.

If most of the servers in the cluster go down, keep alive nodes in secondary state and stop, not elected.

If the primary node elected in the cluster or all secondary nodes from the node look very old, stop the election waiting people to operate.

If there is no problem above, select the server node whose last operation time stamp is the latest (guarantee data is the latest) as the master node.

Here refers to a consistent agreement (in fact, is the bully algorithm), the consistency agreement with the database there are some differences, the agreement focused on the agreement is to ensure that we reach consensus through a number of mechanisms; and consistency agreement emphasizes the order of consistency of the operation , Such as reading and writing a data will not be dirty data. Uniform agreement In the distribution there is a classic algorithm called "Paxos algorithm", follow-up to introduce.

The above problem is that all the last operation from the node are the same as how to do? Who is the fastest time to become the main node to choose who.

Election trigger conditions

The election will not be triggered at any time, the following situations can trigger.

When you initialize a copy set.

The replica set is disconnected from the primary node and may be a network problem.

The main node hangs up.

There is also a prerequisite for the election. The number of nodes participating in the election must be greater than half of the total number of nodes in the replica set. If less than half of all nodes remain read-only. Log will appear:

can not see a majority of the set, relinquishing primary

1. The main node can be human intervention? The answer is yes.

ReplSetStepDown order by the mainframe under the shelf. This command can log in to the master node to use

db.adminCommand ({replSetStepDown: 1})

If you can not kill can use force switch

db.adminCommand ({replSetStepDown: 1, force: true})

Or use rs.stepDown (120) can achieve the same effect, the middle of the figure refers to the period can not stop serving as the main node, in seconds.

2. Set a slave node has a higher priority than the master node.

First check the current priority in the cluster, through the rs.conf () command, the default priority of 1 is not displayed, marked here

If you do not want to be a node from the main node how to operate?

Using rs.freeze (120) to freeze for the specified number of seconds can not be elected as the primary node.

Set the node to the Non-Voting type according to the previous article.

When the master node can not communicate with most of the slave nodes. Unplug the host node network cable, hey hey :)

Priority can also be used so, if we do not want to set any hidden node, use the secondary type as a backup node does not want him to become the master node how to do? As shown in the figure below, a total of three nodes are distributed in two data centers. Data center 2 nodes can not be elected as the primary nodes but can participate in election and data replication. Architecture is still very flexible!

odd number

The number of officially recommended replica sets is an odd number, with up to 12 replica set nodes and up to 7 nodes participating in the election. Up to 12 replica set nodes because there is no need for a copy of the data so much, too much backup but increased network load and slow down the cluster performance; and up to seven nodes participate in the election because the number of internal election mechanism node is too much Will lead to 1 minute also can not choose the main node, as long as everything is appropriate. This "12", "7" figures okay, through their official definition of performance testing can be understood. What are the specific restrictions refer to the official document "MongoDB Limits and Thresholds." But here has not yet understood why the entire cluster odd number, the number of clusters tested even for the operation, refer to this article http://www.itpub.net/thread-1740982-1-1.html. Then suddenly read an article on the stackoverflow finally epiphany, mongodb itself is designed to be a distributed database across IDC, so we should put it into a large environment.

Suppose four nodes are divided into two IDC, each IDC each two machines, as shown below. However, a problem arises in this way. If two IDC networks are broken, this is a problem that can easily occur on the WAN. In the above election, it is mentioned that as soon as the master node and most of the nodes in the cluster are disconnected, However, the MongoDB replica set has only two nodes on each side, but the number of nodes required for the election must be greater than half so that all cluster nodes can not participate in the election and will only be read-only. But in the case of odd nodes, this problem does not occur. Assume that three nodes can be elected as long as two nodes are alive, three of five, four of seven ...

Heartbeat

To sum up, the entire cluster needs to maintain some communication in order to know which nodes are alive and which nodes hang up. The MongoDB node sends a pings packet every two seconds to other nodes in the replica set, marking it inaccessible if the other nodes do not return within 10 seconds. Each node internally maintains a state mapping table that shows the current role of each node, log timestamps and other key information. If it is the primary node, in addition to maintaining the mapping table, you also need to check whether it can communicate with most nodes in the cluster. If not, you can downgrade itself to secondary read-only node.

Synchronize

Replica set synchronization is divided into initial synchronization and keep replication. The initialization synchronization refers to the total amount of data synchronization from the master node. If the master node has a large amount of data, the synchronization time may be longer. The keep replication refers to the initial synchronization, the real-time synchronization between nodes is generally incremental synchronization. Initialization synchronization is not only penalized the first time, but is triggered by two things:

The first time you join, this is for sure.

The amount of data behind the secondary exceeds the size of the oplog, which will also be the full amount of replication.

What is the size of oplog? Speaking oplog save the data before the operation record, the secondary copy oplog and the operation inside the secondary execution again. However, the oplog is also a collection of mongodb, stored in the local.oplog.rs; However, this oplog is a capped collection, which is a fixed-size collection, the new data to join more than the size of the collection will be covered, so here need to pay attention to cross-IDC Copy To set the appropriate oplogSize, to avoid the production environment often produce full copy. oplogSize can be sized with -oplogSize and oplog size defaults to 5% of the remaining disk space for Linux and Windows 64-bit.

Synchronization is not only synchronized from the master node. Suppose there are 3 nodes in the cluster. Node 1 is the master node in IDC1, node 2, node 3 in IDC2, node 2 is initialized, and node 3 synchronizes data from node 1. Later node 2, node 3, uses the proximity principle to replicate from the current IDC's replica set, as long as there is a node that replicates data from node 1 of IDC1.

Set synchronization also pay attention to the following points:

secondary does not copy data from delayed and hidden members.

As long as it is necessary to synchronize, the two members' buildindexes must be the same whether or not they are true and false. The buildindexes is mainly used to set whether the data of this node for the query, the default is true.

If there is no reaction for 30 seconds of synchronization, a node will be selected again for synchronization.

At this point, the problems mentioned earlier in this chapter all solved, I must say MongoDB design is really powerful!

Follow-up to continue to solve the previous section these few questions:

The main node linked automatically switched connection? Currently need to manually switch.

The main node how to deal with the pressure of reading and writing too much?

There are two issues to be resolved later (see next page):

Each node from the above data is a full copy of the database from the node pressure will not be too large?

Data pressure can not be supported by the machine can be automatically extended?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.