MongoDB Replica Set
Introduction to Replication Sets
A MongoDB replica set consists of a set of Mongod instances (processes) that contain one primary node and multiple secondary nodes, and all data from MongoDB Driver (client) is written to primary, Secondary writes data synchronously from primary to keep all members of the replication set stored in the same data set, providing high availability of data.
(Image from MongoDB Official document) is a typical mongdb replica set that contains a primary node and 2 secondary nodes.
MongoDB Replica Set
Primary elections
The replica set is initialized with the Replsetinitiate command (or the MONGO Shell's Rs.initiate ()), initiates a heartbeat message between the members after the initialization, and initiates a Priamry election operation, with a node that "most" members vote for. becomes primary, and the remaining nodes become secondary.
Initialize copy set config = {_id: "My_replica_set", Members: [{_id:0, Host: "rs1.example.net:27017"}, {_id:1, Host: "rs2.example.net:27017"}, {_id:2, Host: "rs3.example.net:27017"},]}
Rs.initiate (config)
Definition of "most"
Assuming that the number of voting members in the replication set (follow up) is N, most of them are N/2 + 1, and when the number of surviving members in the replica set is insufficient, the entire replica set will not be able to elect primary, and the replica set will not be able to provide the write service, which is read-only.
Number of voting members
Most
Tolerance of failures
1 1 0 2 2 0 3 2 1 4 3 1 5 3 2 6 4 2 7 4 3
It is generally recommended to set the number of members of the replica set to odd, from the table above can be seen 3 nodes and 4 nodes of the replica set can only tolerate 1 nodes failure, from the "service availability" point of view, the effect is the same. (but no doubt 4 nodes can provide more reliable data storage.)
Special secondary.
Normally, the seconary of a replica set will participate in the primary election (itself may also be selected as primary) and synchronize the most recently written data from primary to ensure the same data as the primary store.
Secondary can provide read services, and increasing the secondary node can provide the read service capability of the replica set while increasing the availability of the replica set. In addition, MONGODB supports flexible configuration of the secondary nodes of the replica set to accommodate the needs of a variety of scenarios.
Arbiter
The Arbiter node only participates in voting, cannot be selected as primary, and does not synchronize data from primary.
For example, if you deploy a 2-node replica set, 1 primary,1 Secondary, any node outage, the replica set will not be able to provide service (unable to select primary), then you can add a arbiter node to the replication set, even if there is a node outage, can still elect primary.
Arbiter itself does not store data, it is a very lightweight service, and when a replica set member is even, it is best to join a arbiter node to increase the availability of the replica set.
Priority0
The PRIORITY0 node has an election priority of 0 and will not be elected as a primary
For example, you cross the room A, b deployed a replication set, and want to specify primary must be in a room, this can be the copy set member of the B machine room priority set to 0, so that primary will certainly be a room member. (Note: If you deploy this, it is best to deploy "most" nodes in a room, otherwise the network partition may not be able to elect primary)
Vote0
In Mongodb 3.0, there are up to 50 replica set members, up to 7 members participating in primary voting, and other members (VOTE0) must have the vote attribute set to 0, i.e. not voting.
Hidden
The hidden node cannot be selected as the primary (priority is 0) and is not visible to driver.
Because the hidden node does not accept driver requests, you can use the hidden node to do some data backup, offline computing tasks, and no impact on the replication set's services.
Delayed
The delayed node must be a hidden node, and its data lags behind primary for a period of time (configurable, for example, 1 hours).
Because the data of the delayed node is behind primary for a period of time, when the error or invalid data is written to primary, the data from the delayed node can be restored to the previous point in time.
Data synchronization
Primary and secondary through the Oplog to synchronize data, primary on the completion of the write operation, A special Local.oplog.rs special set is written to a oplog,secondary that constantly takes new oplog from primary and applies.
As Oplog data increases, local.oplog.rs is set as a capped collection, and the oldest data is removed when the capacity reaches the configured limit. In addition, considering that Oplog may be reused on secondary, oplog must be idempotent, that is, repeated applications will get the same result.
The following oplog format, including TS, h, OP, NS, O and other fields {"TS": Timestamp (1446011584, 2), "H": Numberlong ("1687359108795812092"), "V": 2, "Op": "I", "ns": "Test.nosql", "O": {"_id": ObjectId ("563062c0b085733f34ab4129"), "name": "MongoDB", "Score": " 100 "}}
? TS: Operation time, current timestamp + counter, counter reset per second? H: Globally unique identifier of the operation? V:oplog version information? OP: Operation type? I: Insert operation? u: Update operation? d: delete operation? C: Execute command (e.g. CreateDatabase , dropdatabase)? N: empty operation, special purpose
? NS: Action-oriented collection? O: Action content, if update operation? O2: Action query condition, only update operation contains this field
Secondary the initial synchronization of data, the init sync is performed, the full data is synchronized from the primary (or other data-updated secondary), and then continuously through the tailable The cursor queries the latest oplog from the primary local.oplog.rs collection and applies it to itself.
The init sync process consists of the following steps T1 time, synchronizing all database data from primary (except local), complete with listdatabases + listcollections + clonecollection-min command combination, Suppose T2 time to complete all operations. All Oplog from the primary application [T1-T2] time period may have been partially included in step 1, but can be reused due to the idempotent nature of the oplog. The index is created for the corresponding collection on secondary, based on the index setting of the primary collection. (the index for each collection _id has been completed in step 1).
The size of the Oplog collection should be configured according to the DB size and application write requirements, configured too large, resulting in a waste of storage space, the configuration is too small, may cause secondary init Sync has not been successful. For example, in step 1 because of the DB data too much, and the Oplog configuration is too small, resulting in oplog insufficient to store [T1, T2] time all oplog, which secondary unable to synchronize the full data set from primary.
Modifying a replica set configuration
When you need to modify a replica set, such as adding members, deleting members, or modifying member configurations (such as Priorty, vote, hidden, delayed, and so on), you can reconfigure the replica set through the Replsetreconfig command (Rs.reconfig ()).
For example, the 2nd member of the replica set priority is set to 2, can execute the following command cfg = rs.conf (); Cfg.members[1].priority = 2; Rs.reconfig (CFG);
Elaborate primary elections
Primary election in addition to the replication set initialization occurs, there are the following scenarios? is the replica set reconfig? Secondary node detects a primary outage, triggering a new primary election? When there is a primary node active Stepdown (proactively downgraded to secondary), a new primary election is also triggered
Primary elections are influenced by various factors, such as heartbeat, priority, and the latest oplog time.
Heartbeat between nodes
The heartbeat information is sent by default between members of the replica set, and if 10s does not receive a heartbeat for a node, the node is considered to be down, and if the node of the outage is primary,secondary (provided that it can be selected as primary) a new primary election will be initiated.
Node priority? Each node tends to vote for the node with the highest priority? A node with a priority of 0 does not initiate primary elections? When primary found that there was a higher priority secondary, and that the secondary's data lagged within 10s, The primary will proactively downgrade, allowing secondary with a higher priority to become the primary opportunity.
Optime
Nodes with the latest optime (the most recent oplog timestamp) can be selected as the primary.
Network partition
Only if the majority of the polling nodes remain connected to the network, the opportunity to be selected is primary, and if primary is disconnected from most nodes, primary is actively downgraded to secondary. When a network partition occurs, multiple primary may occur in a short period of time, so driver is best to set the "most successful" policy when writing, so that even if multiple primary occur, only one primary can successfully write to most.
Read and write settings for replica sets
Read Preference
By default, all read requests to a replica set are sent to Primary,driver to route read requests to other nodes by setting up reading preference. Primary: Default rule, all read requests sent to primary? Primarypreferred:primary priority, if primary unreachable, request secondary? Secondary: All read requests are sent to secondary? Secondarypreferred:secondary first, when all secondary are unreachable, request primary? Nearest: Read request sent to nearest unreachable node (nearest node by Ping probe)
Write Concern
By default, primary completes the write operation and returns, driver can set the rules for write success by setting [Write Concern (https://docs.mongodb.org/manual/core/write-concern/).
The write concern rule setting written below must be successful on most nodes, with a time-out of 5s. Db.products.insert ({item: "Envelopes", qty:100, type: "Clasp"}, {writeconcern: {w:majority, wtimeout:5000}} )
The above settings are for a single request, or you can modify the default write concern of the replica set so that each request is not set separately. CFG = rs.conf () cfg.settings = {} Cfg.settings.getLastErrorDefaults = {w: "Majority", wtimeout:5000} rs.reconfig (CFG)
Exception Handling (rollback)
When primary is down, if there is data that is not synchronized to secondary, when primary rejoin, if a write has already occurred on the new primary, the old primary needs to roll back some operations to ensure that the dataset is consistent with the new primary.
The old primary writes the rollback data to a separate rollback directory, and the database administrator can use Mongorestore for recovery as needed.
MongoDB Replica Set