Build highly available MongoDB clusters

Source: Internet
Author: User
Keywords Large data MongoDB
Tags access analysis application blog post client company configuration copy

MongoDB company formerly known as 10gen, founded in 2007, in 2013 received a sum of 231 million U.S. dollars in financing, the company's market value has been increased to 1 billion U.S. dollar level, this height is well-known open source company Red Hat (founded in 1993) 20 's struggle results.

High performance, easy to expand has been the foothold of the MongoDB, while the specification of documents and interfaces to make it more popular with users, this point from the analysis of the results of Db-engines's score is not difficult to see--just 1 years, MongoDB completed the 7th to fifth place, Scored from 124 points to 214 points, the rise is the fourth potgresql twice times, while the current and PostgreSQL scored only a difference of 16 points.


The speed at which MongoDB can evolve is largely due to the inability of many traditional relational databases to cope with the expanding needs of current data processing, although they are time-tested and have good performance and stability. However, unlike the previous methods, many nosql have their own limitations, which leads to difficulties in getting started. Here we share 严澜 's blog post-How to build an efficient MongoDB cluster.

Here's a blog post:

Deep replica set internal mechanism

The first part of the series describes the configuration of the replica set, which delves into the internal mechanism of the replica set. Let's take a look at the problem with the replica set.

Replica set failover, how is the primary node elected? Can manually interfere with one of the main nodes of the rack.

The official said that the number of copies of the set is best odd, why?

How is the mongdb replica set synchronized? What happens if synchronization does not occur in time? Will there be inconsistency?

Does the mongdb failover happen automatically? What conditions trigger? Frequent triggering can result in increased system load?

Bully algorithm

The mongdb replica set failover function benefits from its electoral mechanism. The election mechanism adopts the bully algorithm, which can easily select the master node from the distributed node. A distributed cluster architecture generally has a so-called master node, can have many uses, such as caching machine node metadata, as a cluster access to the portal and so on. The main node is there, why do we need bully algorithm? To understand this, let's take a look at these two architectures:

Specifies the schema of the master node, which generally declares a node as the primary node, and the other nodes are from nodes, as we often use MySQL. But this architecture we said in the first section of the entire cluster if the primary node is hung up, it has to be done manually, and a new master node or data recovery from the node is not very flexible.


Without specifying the master node, any node in the cluster can become the primary node. MongoDB is the use of this architecture, one but the main node to suspend the other from the node automatically replaced into the main node. The following figure:


Well, the problem is in this place, since all nodes are the same, one but the main node is dead, how to determine the next master node? This is the problem that the bully algorithm solves.

What is the bully algorithm, bully algorithm is a coordinator (master node) campaign algorithm, the main idea is that each member of the cluster can declare it is the master node and notify other nodes. Other nodes may choose to accept this claim or reject it and enter the main node competition. Nodes accepted by all other nodes can become the primary node. The node follows some attributes to determine who should win. This property can be either a static ID or an updated metric like the last transaction ID (the newest node wins). For more information, refer to the NoSQL database distributed algorithm Coordinator campaign and Wikipedia's explanation.


So how did mongdb do the election? The official description:

We use a consensus Kyoto to pick a primary. Exact details would be spared here but the basic process is:

Get maxlocalopordinal to each server.

If a majority of servers are is not up (from this server's POV), remain in secondary mode and stop.

If the last op time is seems very old, stop and await human intervention.

else, using a consensus Kyoto, pick the server with the highest maxlocalopordinal as the Primary.

Roughly translated to select the master node for use with a consistent protocol. The basic steps are:

Gets the last action timestamp for each server node. Each mongdb has a oplog mechanism to record machine operations, easy to compare data with the primary server is also available for error recovery.

If most of the servers in the cluster down machine, keep the living nodes are secondary state and stop, not elected.

If the elected master node in the cluster or all the last sync from the node looks old, stop the election waiting for the person to operate.

If there are no problems above, select the last action timestamp the latest (guaranteed data is current) server node as the primary node.

Here is a consistent protocol (in fact, the bully algorithm), and the consistency of the database protocol is still somewhat different, the agreement is mainly focused on a number of mechanisms to ensure that everyone agree; The consistency protocol emphasizes the sequential consistency of the operations, such as whether or not dirty data will appear when reading and writing a data at the Consistent protocol in the distribution of a classic algorithm called "Paxos algorithm", followed by the introduction.

There is a problem, that is, all the last operating time from the node is the same how to do? Is who first become the main node of the fastest time to choose who.

Election trigger conditions

Elections are not always triggered, and the following can be triggered.

When initializing a replica set.

The replica set and the primary node are disconnected, possibly a network problem.

The master node is dead.

The election also presupposes that the number of nodes participating in the election must be greater than half the number of replicas of the lumped nodes, and that all nodes remain read-only if they are already less than half. The log will appear:

Can ' t-a majority of the set, relinquishing primary

1. Can the main node be hung up for human intervention? The answer is yes.

You can use the Replsetstepdown command to lower the main node. This command can be logged into the master node using the

Db.admincommand ({replsetstepdown:1})

If you can't kill it, you can use the force switch.

Db.admincommand ({replsetstepdown:1, force:true})

or use Rs.stepdown (120) can also achieve the same effect, the middle of the number means can not stop the service in the time to become the main node, in seconds.

2. Set a from node has a higher priority than the primary node.

First look at the priority in the current cluster, with the rs.conf () command, the default priority of 1 is not displayed, which is indicated here

[Java] View plaincopyrs.conf ();

[Java] View plaincopy{

"_id": "Rs0",

"Version": 9,

Members: [


"_id": 0,

"Host": ""},


"_id": 1,

"Host": ""},


"_id": 2,

"Host": ""}



What if you don't want one to be the primary node from the node?

Using Rs.freeze (120) to freeze the specified number of seconds cannot be elected as the primary node.

Set the node to the non-voting type as described in the previous article.

When the primary node cannot communicate with most of the nodes. Unplug the host node network cable, hehe:

Priority can also be so used, if we do not want to set what hidden node, we use the secondary type as a backup node and do not want to make him the main node? Look at the figure below, a total of three nodes distributed in two data centers, data Center 2 node set priority of 0 can not be the primary node, But can participate in the election, data replication. Architecture is still very flexible!



The official recommended replica set has an odd number of members, up to 12 replica set nodes, and up to 7 nodes to participate in the election. Up to 12 replica set nodes because there is no need to copy so many copies of the data, too much backup increases Network load and slows cluster performance, while up to 7 nodes are involved in the election because the number of internal election mechanisms is too large to cause a master node to be selected within 1 minutes, as long as it is appropriate. The "12" and "7" numbers are fine, as they are understood by their official performance tests. Specific restrictions refer to the Official document "MongoDB Limits and Thresholds". But there's no understanding of why the whole cluster has to be odd, and the number of clusters being tested can run by even, referring to this article Then suddenly read a StackOverflow article finally Epiphany, MongoDB itself design is a can across the IDC distributed database, so we should put it to the big environment to see.

Suppose four nodes are divided into two IDC, each two machines per IDC, as shown below. But there's a problem. If two IDC networks are broken, this is a problem that is easy to come up on a wide area network, and it is mentioned in the above elections that as long as the main node and most of the nodes in the cluster are disconnected, a new election operation will begin, but the MongoDB replica set has only two nodes on both sides. But the number of nodes required to participate in the elections must be greater than half, so that all cluster nodes are not able to participate in the election, only in a read-only state. But if the odd node does not appear this problem, assuming 3 nodes, as long as 2 nodes alive can be elected, 5 of 3, 7 of 4 ...



To sum up, the entire cluster needs to maintain a certain amount of communication to know which nodes are alive and which nodes are dead. The MongoDB node sends a pings packet every two seconds to the other nodes in the replica set, and is marked as inaccessible if the other node does not return within 10 seconds. Within each node, a status map is maintained, indicating what role, log timestamp, and other key information are currently each node. In the case of the master node, in addition to maintaining the mapping table, you need to check that you can communicate with most of the nodes in the cluster, and if not, demote yourself to a secondary read-only node.


Replica set synchronization is divided into initialization synchronization and keep replication. Initialization synchronization refers to the total amount of data synchronized from the primary node, if the primary node data volume is longer than the large synchronization time. Keep replication means that the synchronization between nodes is typically incremental synchronization after the initialization of the sync. Initializing synchronization is not only penalized for the first time, but the following two situations trigger:

Secondary for the first time, this is for sure.

The amount of data behind the secondary is greater than the size of the Oplog, which is also replicated in full volume.

So what is the size of the Oplog? Previously said Oplog saved the data of the operation record, secondary copy Oplog and the inside of the operation in the secondary execution. But Oplog is also a collection of MongoDB, kept in, but this oplog is a capped collection, that is, a fixed size set, the new data added to exceed the size of the collection will be overwritten, so here's the note, Replication across IDC should be set up to oplogsize to avoid the need to produce full replication in production environments. Oplogsize can be set by –oplogsize size, and for Linux and Windows 64 bits, Oplog size defaults to 5% of the remaining disk space.

Synchronization is not only synchronized from the main node, assuming that 3 nodes in the cluster, node 1 is the primary node in IDC1, Node 2, node 3 in the IDC2, initialization Node 2, node 3 will synchronize data from Node 1. The next node 2, node 3, uses the proximity principle to replicate from the current IDC replica set, as long as one node replicates data from IDC1 Node 1.

The following points are also noted for setting up synchronization:

Secondary does not copy data from delayed and hidden members.

As long as synchronization is required, the buildindexes of two members must be the same regardless of whether true or false. Buildindexes is primarily used to set whether this node's data is used for queries, and defaults to true.

If the synchronization operation does not respond for 30 seconds, a node is selected for synchronization.

All the problems mentioned earlier in this chapter have been solved, and I have to say that MongoDB's design is really powerful!

Follow-up continues to address these issues in the previous section:

Can I switch the connection automatically when the main node is hung?

How to solve the high reading and writing pressure of the main node?

In the early stage of the system, the amount of data will not cause too much problem, but as the amount of data continues to increase, sooner or later there will be a machine hardware bottleneck problem. And MongoDB is the mass data architecture, he can not solve the massive data how to line! Fragment "Use this to solve this problem.

How does the traditional database do the massive data read and write? In fact, a word: divide and conquer. As you can see above, the following Taobao Yeu Xuqiang refers to the architecture diagram:


The above figure has a TDDL, is a Taobao data access layer component, his main role is SQL parsing, routing processing. Resolves the currently accessed SQL judge in which business database, which table accesses the query, and returns the data results, based on the functionality of the application request. Detailed diagram:


Having said so many traditional database architectures, how did NoSQL do that? MySQL to do the automatic expansion of the need to add a data access layer to expand the program, database additions, deletions, backups need to control the program. One but the database node more than one, to maintain is also very headache. But mongodb all of this through his own internal mechanism can be done! or the above figure to see what mechanisms mongodb through to achieve routing, fragmentation:


You can see from the diagram that there are four components: MONGOs, config server, shard, replica set.

MONGOs, the portal of the database cluster request, all requests are coordinated through MONGOs, no need to add a routing selector in the application, MONGOs is a request distribution center, it is responsible for the corresponding data request request to the corresponding Shard server. In the production environment there is usually more than MONGOs as the entrance of the request, preventing one of them from hanging all MONGODB requests have no way to operate.

Config server, as the name implies, is configured to store all database meta information (routing, fragmentation) configuration. The mongos itself does not physically store fragmented servers and data routing information, but is cached in memory, and the configuration server actually stores the data. MONGOs The first time you start or turn off the reboot will load the configuration information from config server, then if the configuration server information changes will notify all the MONGOs update their status, so MONGOs can continue to accurately route. There are usually multiple config server configuration servers in a production environment, because it stores the metadata for a fragmented route, which cannot be lost! Even if one of them is hung up, the MongoDB cluster will not hang up.

Shard, this is the legendary fragment. The above mentioned a machine even if the ability to have a ceiling, like the Army war, a person again drink blood bottle also spell each other's one division. As the saying goes, heads the top of Zhuge Liang, this time the strength of the team is highlighted. In the Internet is also the case, a common machine can not do many machines to do, the following figure:


A data table of a machine Collection1 store 1T data, the pressure is too big! After 4 machines are allocated, each machine is 256G, and the pressure on one machine is apportioned. Maybe someone asked a machine hard drive to increase a little more than that, why should be divided to four machines? Do not think of storage space, the actual running of the database also has hard disk reading and writing, network IO, CPU and memory bottlenecks. As long as the fragment rule is set up in the MongoDB cluster, the corresponding data operation request can be forwarded to the corresponding slicing machine automatically through the MONGOs operation database. In the production environment fragment key can be set up, this affects how the data evenly divided into a number of pieces of the machine, do not appear in one of the machines divided into 1T, other machines are not divided into the situation, this is not as good as not fragmentation!

Replica set, the last two sections have been detailed about this dongdong, how come here to join the fun! In fact, the above 4 slices if there is no replica set is an incomplete schema, suppose one of the fragments hang out that one-fourth of the data is lost, Therefore, the high availability fragmentation architecture also needs to build the replica set replica set for each fragment to ensure the fragmentation reliability. The production environment is usually 2 copies + 1 arbitrations.

Say so much, still come to combat how to build a highly available MongoDB cluster:

First determine the number of components, MONGOs 3, config server 3, the data is 3 Shard server 3, each shard has a copy of a quorum that is 3 * 2 = 6, a total of 15 instances need to be deployed. These examples can be deployed in stand-alone machines can also be deployed in a machine, we have limited testing resources here, only prepared 3 machines, in the same machine as long as the port is different can, look at the physical deployment diagram:


Architecture set up, install the software!

1. Prepare the machine, the IP is set separately:,,

2. Set up the MongoDB fragment corresponding test folder on each machine respectively.





3. Download MongoDB installation package



Tar xvzf mongodb-linux-x86_64-2.4.8.tgz

4. Set up MONGOs, config, Shard1, Shard2, shard3 five directories per machine respectively.

Because the MONGOs does not store the data, you only need to set up a log file directory.



#建立config Server data File storage directory


#建立config server log file storage directory


#建立config server log file storage directory


#建立shard1 Data File storage directory


#建立shard1 Log File storage directory


#建立shard2 Data File storage directory


#建立shard2 Log File storage directory


#建立shard3 Data File storage directory


#建立shard3 Log File storage directory


5. Plan the corresponding port number for 5 components, so the ports need to be differentiated because a machine needs to deploy MONGOs, config server, shard1, Shard2, and Shard3 at the same time.

This port can be defined freely, in this article MONGOs is 20000, config server is 21000, Shard1 is 22001, Shard2 is 22002, Shard3 is 22003.

6. Start the configuration server separately on each server.

/data/mongodbtest/mongodb-linux-x86_64-2.4.8/bin/mongod--configsvr--dbpath/data/mongodbtest/config/data--port 21000--logpath/data/mongodbtest/config/log/config.log--fork

7. Start the MONGOs server separately on each server.

/data/mongodbtest/mongodb-linux-x86_64-2.4.8/bin/mongos--configdb,, 20000--logpath/data/mongodbtest/mongos/log/ Mongos.log--fork

8. Configure the replica sets for each slice.


/data/mongodbtest/mongodb-linux-x86_64-2.4.8/bin/mongod--shardsvr--replset shard1--port 22001 Mongodbtest/shard1/data--logpath/data/mongodbtest/shard1/log/shard1.log--fork--nojournal--oplogSize 10

In order to quickly start and save the test environment storage space, this plus nojournal is to turn off log information, in our test environment does not need to initialize such a large redo log. Also set oplogsize is to reduce the size of the local file, Oplog is a fixed-length capped collection, which exists in the "local" database to record replica sets operation logs. Note that the settings here are for testing!


/data/mongodbtest/mongodb-linux-x86_64-2.4.8/bin/mongod--shardsvr--replset shard2--port 22002 Mongodbtest/shard2/data--logpath/data/mongodbtest/shard2/log/shard2.log--fork--nojournal--oplogSize 10


/data/mongodbtest/mongodb-linux-x86_64-2.4.8/bin/mongod--shardsvr--replset shard3--port 22003 Mongodbtest/shard3/data--logpath/data/mongodbtest/shard3/log/shard3.log--fork--nojournal--oplogSize 10

Each fragment configuration replica set, in-depth understanding of the replica set reference the previous articles in this series.

Arbitrary landing of a machine, such as landing, connecting MongoDB




Use admin


Config = {_id: "Shard1", members:[

{_id:0,host: ""},

{_id:1,host: ""},

{_id:2,host: "", arbiteronly:true}




Rs.initiate (config);




Use admin


Config = {_id: "Shard2", members:[

{_id:0,host: ""},

{_id:1,host: ""},

{_id:2,host: "", arbiteronly:true}




Rs.initiate (config);




Use admin


Config = {_id: "Shard3", members:[

{_id:0,host: ""},

{_id:1,host: ""},

{_id:2,host: "", arbiteronly:true}




Rs.initiate (config);

9. At present, the MongoDB configuration server, routing server, each fragmented server, but the application connected to the MONGOs routing server and can not use the fragmentation mechanism, but also the program to set up a fragmented configuration, so that fragmentation effective.


#使用admin数据库 User admin


Db.runcommand ({addshard: "shard1/,,"});

If the Shard is a single server, join a command such as Db.runcommand ({addshard: "[:]"}), if Shard is a replica set, with Db.runcommand ({addshard: "replicasetname/[: port][,serverhostname2[:p ORT],...] "}); This format represents.


Db.runcommand ({addshard: "shard2/,,"});


Db.runcommand ({addshard: "shard3/,,"});


Db.runcommand ({listshards:1});


[Plain] View plaincopy{

"Shards": [


"_id": "Shard1",

"Host": "Shard1/,"



"_id": "Shard2",

"Host": "Shard2/,"



"_id": "Shard3",

"Host": "Shard3/,"



"OK": 1


Because is the quorum node for each fragmented replica set, the results are not listed above.

10. Currently configure services, routing services, fragmented services, replica set services have been concatenated, but our goal is to insert data, the data can be automatically fragmented, just a little bit, a little bit ...

Connects to the MONGOs and prepares the specified database, specified set fragment to take effect.


Db.runcommand ({enablesharding: "TestDB"});


Db.runcommand ({shardcollection: "Testdb.table1", key: {id:1}})

We set the TestDB table1 table need to fragment, according to the ID automatically fragment to Shard1, Shard2,shard3 above. This setting is because not all MONGODB databases and tables need to be fragmented!

11. Test fragmentation configuration results.



#使用testdb use TestDB;


for (var i = 1; I <= 100000; i++) ({id:i, "test1": "Testval1"});

#查看分片情况如下, part of the irrelevant information skipped

Db.table1.stats ();

[Java] View plaincopy{

' sharded ': true,

"NS": "Testdb.table1",

Count: 100000,

"Numextents": 13,

Size: 5600000,

"Storagesize": 22372352,

"Totalindexsize": 6213760,

"Indexsizes": {

"_id_": 3335808,

"Id_1": 2877952


"Avgobjsize": 56,

"Nindexes": 2,

"Nchunks": 3,

"Shards": {

"Shard1": {

"NS": "Testdb.table1",

Count: 42183,

Size: 0,


"OK": 1


"Shard2": {

"NS": "Testdb.table1",

Count: 38937,

Size: 2180472,


"OK": 1


"Shard3": {

"NS": "Testdb.table1",

Count: 18880,

Size: 3419528,


"OK": 1



"OK": 1


You can see that the data is divided into 3 slices, the respective number of fragments is: Shard1 "Count": 42183,shard2 "Count": 38937,shard3 "Count": 18880. It's done! Not too much seems to be not very uniform, so this fragment is very fastidious, followed by in-depth discussion.

The Java program calls the fragmented cluster because we have three MONGOs as the portal, even if one of the entrances is hung, it doesn't matter, use the cluster client program as follows: [Java] View Plaincopypublic class testmongodbshards {public static void main (string] args)

{try {List addresses = new ArrayList ();

ServerAddress Address1 = new ServerAddress ("", 20000); ServerAddress

Address2 = new ServerAddress ("", 20000); ServerAddress ADDRESS3

= new ServerAddress ("", 20000); Addresses.add (ADDRESS1);

Addresses.add (ADDRESS2); Addresses.add (ADDRESS3); Mongoclient client =

New Mongoclient (addresses); DB db = Client.getdb ("TestDB"); Dbcollection

coll = db.getcollection ("table1"); Basicdbobject object = new Basicdbobject ();

Object.append ("id", 1); DBObject DBObject = Coll.findone (object); System.

Out. println (DBObject); catch (Exception e) {e.printstacktrace ();}

} }

The entire fragmentation cluster is finished, think about our architecture is good enough? There are many places to optimize, such as we put all the quorum node on one machine, the other two machines assume all the read and write operations, but as the of arbitration is quite idle. Let the machine 3 share some responsibilities! The architecture can be adjusted so that the load of the machine is more balanced, and each machine can be used as the master node, the replica node, the quorum node, so the pressure will be much more balanced, as shown in figure:


Of course, the production environment data is far greater than the current test data, large-scale data applications we can not put all the nodes like this deployment, hardware bottlenecks is a mishap, can only expand the machine. There are a number of mechanisms that need to be tweaked to use good mongodb, but with this stuff we can quickly achieve high availability and scalability, so it's a great nosql component.

Take a look at the MongoDB Java-driven client mongoclient (addresses), which can pass in multiple MONGOs addresses as the gateway to the MongoDB cluster, and automate failover, but does load balancing do well? Open Source view:


Its mechanism is to select a ping as the fastest machine to be the entrance to all requests, if the machine hangs off will use the next machine. So ... It's not going to work! In the event of a double 11, all requests are sent to this machine, the machine is likely to be dead. Once hung up, according to its mechanism will transfer requests to step down the machine, but the total pressure is still not reduced AH! The next one is still likely to crash, so this architecture is still vulnerable! Please wait for the following resolution.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.