Synchronization process
Select the node from which to pull the oplog
1. Applies the op executes this op log
2. Writes the op to its own oplog (also local. oplog. rs) write this op log to your oplog
3. Requests the next op request to which the next op log secondary node is synchronized
Where is the secondary node synchronized?
The master node obtains the oplog timestamp from the Slave node to determine where the data is synchronized.
How does primary know where secondary is synced? Well, secondary is querying primary's oplog for more results. So, if secondary requests an op written at 3, primary knows seconday has replicated all ops written before 3 pm.
So, it goes like:
1. Do a write on primary.
2. Write is written to the oplog on primary, with a field "ts" saying the write occurred at time t.
3. {getLastError: 1, w: 2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w: 2 ).
4. secondary queries the oplog on primary and gets the op
5. secondary applies the op from time t
6. secondary requests ops with {ts: {$ gt: t} from primary's oplog
7. primary updates that secondary has applied up to t because it is requesting ops> t.
8. getLastError notices that primary and secondary both have the write, so w: 2 is satisfied and it returns.
Synchronization principle
If A synchronizes data from B and B from C, how does C know where A is synchronized? View the oplog reading protocol:
When A synchronizes data from B, B says to A, I synchronize data from the annual oplog. If you have any write operations, let me know.
B replied, I am not the master node. I will forward it later. B will tell the master node C that I am A and I will synchronize data from you on behalf of. At this time, B has two connections with the master node C. One is B's own, and the other is A's.
When A requests ops (write operation) from B, B switches to C to complete A's request.
A B C
<====>
<====> <---->
<====> Is A "real" Synchronous connection. <----> "ghost" connection, and B represents the connection between A and C.
Initialize synchronization
When a new member is added or the synchronization is redone, initialization and synchronization are performed.
Step 7:
1. check the oplog. if it is not empty, this node does not initial sync, it just starts syncing normally. if the oplog is empty, then initial sync is necessary, continue to step #2: Check the oplog. If it is null, initialize the synchronization. Otherwise, perform normal synchronization.
2. Get the latest oplog time from the source member: call this time start. Get the latest oplog time of the synchronization source node, marked as start
3. Clone all of the data from the source member to the destination member. Copy all the data to the target node.
4. Build indexes on destination. Create an index for the target node. Version 2.0 is included in the data replication step, and version 2.2 is used to create an index after data replication.
5. Get the latest oplog time from the sync target, which is called minValid. Obtain the latest oplog time of the target node and mark it as minValid.
6. Apply the sync target's oplog from start to minValid. Execute oplog between start and minValid on the target node.
7. Become a "normal" member (transition into secondary state). Become a normal member
In my personal understanding, the oplog between start and minValid is a copy of the unexecuted oplog and does not complete the final consistency. It is an oplog replay process.
Check the source code rs_initialsync.cpp. The synchronization initialization steps are as follows:
/**
Do the initial sync for this member. There are several steps to this process:
*
Record start time.
Clone.
Set minValid1 to sync target's latest op time.
Apply ops from start to minValid1, fetching missing docs as needed.
Set minValid2 to sync target's latest op time.
Apply ops from minValid1 to minValid2.
Build indexes.
Set minValid3 to sync target's latest op time.
Apply ops from minValid2 to minValid3.
*
At that point, initial sync is finished. Note that the oplog from the sync target is applied
Three times: step 4, 6, and 8. 4 may involve refetching, 6 shocould not. By the end of 6,
This member shoshould have consistent data. 8 is "cosmetic," it is only to get this member
Closer to the latest op time before it can transition to secondary state.
*/
Clone data to copy data:
For each db on sourceServer:
For each collection in db:
For each doc in db. collection. find ():
DestinationServer. getDB (db). getCollection (collection). insert (doc)
Initialize synchronization features
Benefit: data is more compact, saving disk space, because all operations are insert. Note that the padding factor is set to 1.
Bad thing: the synchronization speed is too slow. Use fsync + lock and write lock to copy data files for faster synchronization.
In addition, mongodump/mongorestore is not suitable for restoring data without oplog.
Who to sync from)
During data initialization and synchronization, MongoDB may synchronize data from the master node or from the Slave node. Based on the proximity principle, select the nearest node for data synchronization;
By default, the member syncs from the closest member of the set that is either the primary or another secondary with more recent oplog entries. This prevents two secondaries from syncing from each other.
Http://docs.mongodb.org/manual/core/replication-internals/
As shown in the log mentioned in the previous article, [rsSync] replSet syncing to: 10.0.0.106: 20011
Here, syncing to actually refers to syncing from, which is used for version compatibility reasons, as kristina chodorow says 'backwards compatibility suces '.
Replica Sets select the nearest node (based on the ping value) and use the following algorithm to select the node from which to synchronize:
For each member that is healthy:
If member [state] = PRIMARY
Add to set of possible sync targets
If member [lastOpTimeWritten]> our [lastOpTimeWritten]
Add to set of possible sync targets
Sync target = member with the min ping time from the possible sync targets
Different versions determine whether the node is healthy, but the purpose is to find a healthy node. In version 2.0, its judgment also includes the salve delay factor.
Run db in secondary. adminCommand ({replSetGetStatus: 1}) or rs. status () command to view the current node status, you can see the syncingTo field, the value of this field is the synchronization source of this secondary.
2.2 The replSetSyncFrom command is added to specify the node from which data is synchronized.
Db. adminCommand ({replSetSyncFrom: "[hostname]: [port]"})
Or
Rs. syncFrom ("[hostname]: [port]")
How do I select the nearest node? See the source code, the latest 2.2.2 for example: mongodb-src-r2.2.2/src/mongo/db/repl/rs_initialsync.cpp
Member * ReplSetImpl: getMemberToSyncTo (){
Lock lk (this );
Bool buildIndexes = true;
// If we have a target we 've requested to sync from, use it
If (_ forceSyncTarget ){
Member * target = _ forceSyncTarget;
_ ForceSyncTarget = 0;
Sethbmsg (str: stream () <"syncing to:" <target-> fullName () <"by request", 0 );
Return target;
}
Member * primary = const_cast <Member *> (box. getPrimary ());
// Wait for 2N pings before choosing a sync target
If (_ cfg ){
Int needMorePings = config (). members. size () * 2-HeartbeatInfo: numPings;
If (needMorePings> 0 ){
OCCASIONALLY log () <"waiting for" <needMorePings <"pings from other members before syncing" <endl;
Return NULL;
}
BuildIndexes = myConfig (). buildIndexes;
// If we are only allowed to sync from the primary, return that
If (! _ Cfg-> chainingAllowed ()){
// Returns NULL if we cannot reach the primary
Return primary;
}
}
// Find the member with the lowest ping time that has more data than me
// Find primary's oplog time. Reject sync candidates that are more
// MAX_SLACK_TIME seconds behind.
OpTime primaryOpTime;
Static const unsigned maxSlackDurationSeconds = 10*60; // 10 minutes
If (primary)
PrimaryOpTime = primary-> hbinfo (). opTime;
Else
// Choose a time that will exclude no candidates, since we don't see a primary
PrimaryOpTime = OpTime (maxSlackDurationSeconds, 0 );
If (primaryOpTime. getSecs () <maxSlackDurationSeconds ){
// Erh-I think this means there was just a new election
// And we don't yet know the new primary's optime
PrimaryOpTime = OpTime (maxSlackDurationSeconds, 0 );
}
OpTime oldestSyncOpTime (primaryOpTime. getSecs ()-maxSlackDurationSeconds, 0 );
Member * closest = 0;
Time_t now = 0;
// Make two attempts. The first attempt, we ignore those nodes
// Slave delay higher than our own. The second attempt between Des such
// Nodes, in case those are the only ones we can reach.
// This loop attempts to set 'closest '.
For (int attempts = 0; attempts <2; ++ attempts ){
For (Member * m = _ members. head (); m = m-> next ()){
If (! M-> hbinfo (). up ())
Continue;
// Make sure members with buildIndexes sync from other members w/indexes
If (buildIndexes &&! M-> config (). buildIndexes)
Continue;
If (! M-> state (). readable ())
Continue;
If (m-> state () = MemberState: RS_SECONDARY ){
// Only consider secondaries that are ahead of where we are only considering the OpTime node before the current node
If (m-> hbinfo (). opTime <= lastOpTimeWritten)
Continue;
// Omit secondaries that are excessively behind, on the first attempt at least.
If (attempts = 0 &&
M-> hbinfo (). opTime <oldestSyncOpTime)
Continue;
}
// Omit nodes that are more latent than anything we 've already considered ignore ping value delay
If (closest &&
(M-> hbinfo (). ping> closest-> hbinfo (). ping ))
Continue;
If (attempts = 0 &&
(MyConfig (). slaveDelay <m-> config (). slaveDelay | m-> config (). hidden )){
Continue; // skip this one in the first attempt
}
Map <string, time_t >:: iterator vetoed = _ veto. find (m-> fullName ());
If (vetoed! = _ Veto. end ()){
// Do some veto housekeeping
If (now = 0 ){
Now = time (0 );
}
// If this was on the veto list, check if it was vetoed in the last "while". check whether the deny list exists.
// If it was, skip.
If (vetoed-> second> = now ){
If (time (0) % 5 = 0 ){
Log () <"replSet not trying to sync from" <(* vetoed). first
<", It is vetoed for" <(* vetoed). second-now) <"more seconds" <rsLog;
}
Continue;
}
_ Veto. erase (vetoed );
// Fall through, this is a valid candidate now
}
// This candidate has passed all tests; set 'closest' to the nearest node if all conditions are met
Closest = m;
}
If (closest) break; // no need for second attempt
}
If (! Closest ){
Return NULL;
}
Sethbmsg (str: stream () <"syncing to:" <closest-> fullName (), 0 );
Return closest;
}