MongoDB Replica Set principle

Source: Internet
Author: User
Tags joins

Copyright notice: This article by the Comte rain original article, reproduced please indicate the source:
Article original link: https://www.qcloud.com/community/article/136

Source: Tengyun https://www.qcloud.com/community

MongoDB single-instance mode, one Mongod process is an instance, one instance contains several db, and each DB contains several tables.
MongoDB uses a special table to local.oplog.rs store Oplog, which is characterized by a fixed size, full of deleted oldest records inserted into a new record, and only supports append operations, so it can be understood as a persistent ring-buffer. Oplog is the core functional point of the MONGODB replication set.
A MongoDB replica set is a technique by which MongoDB instances achieve data redundancy by replicating and applying other instances of Oplog.

There are generally two ways to make a common copy set (note that you can manually specify a copy source by using Mongoshell, but MONGDB does not guarantee that the designation is persistent, and that in some cases MongoDB automatically replicates the source switch).

MongoDB's replica set technology is not uncommon, much like MySQL's asynchronous replication pattern, which has several technical points:

    1. New node joins, initialization before normal synchronization

    2. How the remaining secondary nodes are serviced after the primary node has been hung out

    3. How to ensure that the data is not lost after the primary node is hung out/processing of lost data after the primary node is hung out

MongoDB as a mature database product, better solve the above problems, a complete copy set contains the following features:

    1. Data synchronization
      1.1 Initial-sync
      1.2steady-sync
      1.3 Exception Data rollback

    2. MongoDB Cluster heartbeat and election

I. Data Synchronization Initial_sync

When a node joins a cluster, it needs to initialize the data so that it has less data than the other nodes in the cluster, a process called Initial-sync.
A initial-sync consists of six steps (reading rs_initialSync.cpp:_initialSync function logic)

    1. Delete all db locally except local library
    2. Select a source node to import all the db from the source node locally (note that only the data is imported here, not the index)
    3. Will 2) Start execution to the end of execution the source generated by the Oplog applied to the local
    4. Will 3) Start execution to the end of execution the source generated by the Oplog applied to the local
    5. Rebuild all table indexes locally from source (import index)
    6. Will 5) Start execution to the end of execution the source generated by the Oplog applied to the local
      When the 6th step is completed, the source and local gaps are small enough for MongoDB to enter the secondary (from node) state.

2nd) step to copy all data, so general 2nd) step is the longest, 3rd) and 4th) step is a continuous approximation process, MongoDB here did twice
is because the 2nd step generally takes too long, leading to the 3rd) step data more quantitative, indirectly affected. However, it is not necessary to do so, rs_initialsync.cpp:384 started Todo suggests using Synctail to read the data one at a time (Synctail and Tailablecursor's behavior and principles if unfamiliar, see official documentation

Steady-sync

When the node initialization is completed, it will enter the Steady-sync state, as the name implies, under normal circumstances, this is a stable quiet running in the background, from the source of replication constantly synchronizing the new Oplog process. The process typically occurs with both of these issues:

    1. Replication source writes are too fast (or relatively slow for local writes), and the oplog of the replication source overrides the cursor that is maintained on the source locally for synchronizing the source Oplog.
    2. This node is primary before the outage, and the local oplog has a oplog that is inconsistent with the current primary after the reboot.
      The two scenarios are as follows:

These two situations are in the bgsync.cpp:_produce function, although the two situations are very different, but in the end will go into the bgsync.cpp:_rollback function processing,
For the second case, the process is in the following rs_rollback.cpp steps:

  1. Maintains local and remote two reverse cursors to find LCA with linear time complexity (recent public ancestor, Record4 in Conflict.png)
    This process is similar to the process of finding common nodes in the classic two ordered list, which is implemented in Roll_back_local_operations.cpp:syncRollBackLocalOperations, Readers can think of how this process is implemented in linear time complexity.

  2. For each locally conflicting oplog, enumerate the type of the oplog, infer the inverse action required to roll back the Oplog and log, as follows:
    2.1:create_table-Drop_table
    2.2:drop_table. Resync the table
    2.3:drop_index, re-sync and build the index
    2.4:drop_db, rollback, changed by the user manual Init_resync
    2.5:apply_ops, for each note in the Apply_ops Oplog, recursive execution 2) this process
    2.6:create_index-Drop_index
    2.7: cud operation of Normal document, re-read the real value from primary and replace it. The correlation functions are:rs_rollback.cpp:refetch

  3. For the 2 analysis of each Oplog processing method, the execution of processing, the correlation function is rs_rollback.cpp:syncFixUp, here the operation is mainly to step 2) practice, the actual processing process is quite cumbersome.

  4. Truncate off the oplog of local conflicts.
    Above we said, for the local stall (stale) situation, but also to go _rollback process Unified processing, for stall, walk _rollback will be looking for LCA this step failed, then try to replace the source of replication, Find a node that is "not in a stall" from all the secondary and primary nodes that are currently alive.
    Here it is necessary to explain that the oplog is a finite size ring-buffer, the only condition of stall is: local maintenance of the copy source of the cursor is copied from the source of the write overwrite (imagine you and the classmate at the same time started running around the playground, when you were more than a circle of classmates, you and classmates met). Therefore, if the oplog of some nodes is larger, the time to complete a lap is longer, and using such a node as the source of replication, the likelihood of stall is smaller.
    A description of the cluster data synchronization for MongoDB is a temporary paragraph. We use a flowchart to summarize:
Steady-sync's threading model and Oplog instruction disorderly order acceleration

The code associated with Steady-sync has Bgsync.cpp, sync_tail.cpp. As we have described above, the Steady-sync process reads the newly generated oplog from the replication source and applies it locally, so that the process is a producer-consumer model. Because the Oplog needs to guarantee the order, the producer can only be implemented single-threaded.
For consumer, is there a concurrent speed-up mechanism?

  1. First, there is no need to guarantee the order of Oplog apply between unrelated documents, so you can group Oplog by ObjID Hash. Strict write order must be ensured within each group.

    572voidFillwritervectors(operationcontext* Txn,573 multiapplier::operations* Ops,574STD::vector<multiapplier::operationptrs>* writervectors) {581for (auto&& op: *ops) {582 Stringmaptraits::HashedkeyHashedns(OP.NS);583uint32_t hash = Hashedns.hash ();584585For doc locking engines, include the _id of the document in the hash so we get586Parallelism even if all writes is to a single collection. We can ' t do the for capped48VCollections because the order of inserts is a guaranteed property, unlike for normal588Collections.589if (supportsdoclocking && op.iscrudoptype () &&!iscapped (TXN, Hashedns)) {590 bsonelement id = op.getidelement (); 591 const size_t IdHash = Bsonelement::hasher () (ID); 592 murmurhash3_x86_32 (&idhash, sizeof (IdHash), hash, &hash); 593} 601 auto& writer = (*writervectors) [Hash% numwriters];< Span class= "Hljs-number" >602 if (Writer.empty ()) 603 Writer.reserve (8); //skip a few growth rounds. 604 writer.push_back (&OP); 605} 606}             
  2. Second, for command commands, there is a global effect on the table or library, so command commands must be processed separately after the current consumer completes work, and no other commands can be executed at the same time when the command Oplog is processed. This can be analogous to the SMP architecture
    cpu-memory-barrior

    899Check for OPS, that must is processed one at a time.900if (Entry.raw.isEmpty () | | |Sentinel that network queue is drained.901 (entry.optype[0] = =' C ') | |Commands.902Index builds is achieved through the use of an insert OP, not a command op.903The following line is the same as, the Insert code uses to detect an index build.904 (!ENTRY.NS.Empty () && nstocollectionsubstring (entry.ns) = = "system.indexes")) {905 if (ops->getcount () = = 1) { Span class= "Hljs-number" >906 //apply commands One-at-a-time 907 _networkqueue->consume (TXN); 908} else {909 //This OP must is processed alone, but we already had ops in the queue so we can ' T910 //include it in this batch. Since we didn ' t call consume (), we'll see this again next911 //time and process it alone. 912 ops->pop_back (); 913}             
  3. The order of Oplog from the library and the main library must be exactly the same, so the order of the Oplog must be guaranteed regardless of the order in which the user data is written in 1 or 2 steps. The capped-collection of the Mmap engine can only be guaranteed by sequential insertion, so the insertion of the Oplog is a single-threaded process. For the capped-collection of the Wiredtiger engine, you can index the TS (timestamp field) to ensure that the order of reading is independent of the order in which it was inserted.

    517Only doc-locking engines support parallel writes to the Oplog because they is required to518 //ensure that Oplog entries is ordered correctly, even if I Nserted Out-of-order. Additionally,519 //there would be no-to-take advantage of multi PLE threads If a storage engine doesn ' t520 //support document Locking. 521 if (!enoughtomultithread | | 522!txn->getservicecontext ()->getglobalstorageengine ()->supportsDocLocking ( ) {523524 threadpool->schedule (MakeOplogWriterForRange ( Span class= "Hljs-number" >0, Ops.size ())); 525 return false;526}              

Steady-sync class dependencies and threading models are summarized as:

Two. MongoDB Heartbeat and election mechanism

The main node of the MongoDB election is triggered by the heartbeat. Any two nodes in a replica set n nodes maintain the heartbeat, each node maintains the state of the other N-1 nodes (the state is only the POV of that node, such as the network partition, at the same time a watch C is in the down state, B observes C is in the Seconary state)

The POV of any one node attempts to downgrade the master node (step down primary) (Topology_coordinator_impl.cpp:_updateprimaryfromhbdata) after each heartbeat, The reason for the primary node demotion is as follows:

    1. The heartbeat detects that other primary nodes have a higher priority than the current primary node, attempting to demote the primary node (Stepdown) to
      Secondary, the dynamic change of the primary value provides a way for operations to change the master node with heat
    2. If this node is the primary node but cannot ping more than half of the nodes in the cluster (the majority principle), demote itself to secondary

Elect the master node

Secondary node detects that the current cluster does not have a surviving master node, it attempts to elect itself as a primary. The main node election is a two-stage process + majority protocol.

First Stage

In their own POV, to detect whether they are eligible for election:

    1. Can ping a majority of nodes in a cluster
    2. Priority must be greater than 0
    3. cannot be a arbitor node
      If detection passes, send Freshnesscheck to all surviving nodes in the cluster (ask other nodes about whether "I" is eligible for election)

Peer arbitration

In the first stage of the election, a node will be given more stringent peer arbitration after receiving an election request from another node.

    1. The primary of other nodes in the cluster is higher than the initiator
    2. cannot be a arbitor node
    3. Primary must be greater than 0
    4. To the referee's POV, the initiator's oplog must be oplog up-to-date in the cluster's surviving node (there can be equal conditions, everyone is up to date)
Phase II

The initiator sends a elect request to the surviving node in the cluster, the node to which the arbitrator receives the request performs a series of legality checks, and if the check is passed, the arbitrator casts a vote to the initiator and obtains a 30 second "election lock", the role of which is to not vote for the other initiators within the time of holding the lock.
If the initiator or more than half of the votes, the election through, itself become primary node. The reason for getting less than half of the votes is that, in addition to common network problems, it is also a reason that nodes of the same priority pass through the first phase of peer arbitration and enter the second phase. Therefore, when the ballot is insufficient, it will sleep[0,1] a random time in seconds, and then try the election again.

MongoDB Replica Set principle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.