同步過程
選取從哪個節點同步後,拉取oplog
1.Applies the op執行這個op日誌
2.Writes the op to its own oplog (also local.oplog.rs)將這個op日誌寫入到自己的oplog中
3.Requests the next op請求下一個op日誌secondary節點同步到哪了
secondary節點同步到哪了
主節點根據從節點擷取oplog的時間戳記可以判斷資料同步到哪了
How does primary know where secondary is synced to? Well, secondary is querying primary‘s oplog for more results. So, if secondary requests an op written at 3pm, primary knows seconday has replicated all ops written before 3pm.
So, it goes like:
1.Do a write on primary.
2.Write is written to the oplog on primary, with a field “ts” saying the write occurred at time t.
3.{getLastError:1,w:2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w:2).
4.secondary queries the oplog on primary and gets the op
5.secondary applies the op from time t
6.secondary requests ops with {ts:{$gt:t}} from primary‘s oplog
7.primary updates that secondary has applied up to t because it is requesting ops > t.
8.getLastError notices that primary and secondary both have the write, so w:2 is satisfied and it returns.
同步原理
如果A從B同步資料,B從C同步,C怎麼知道A同步到哪了?看oplog讀取協議:
當A從B同步資料,B對A說,我從年度oplog同步資料,如果你有寫操作,告訴我一下。
B回答,我不是主節點,等我轉寄一下;B就跟主節點C說,就當做我是A,我代表A從你這同步資料。這時B與主節點C有兩個串連,一個是B自己的,一個是代表A的。
A向B請求ops(寫操作),B就轉向C,這樣來完成A的請求。
A B C
<====>
<====> <---->
<====> 是”真正”的同步串連. <----> “ghost” connection,B代表A與C的串連。
初始化同步
新增成員或者重做同步的時候,會進行初始化同步。
如下7步:
1.Check the oplog. If it is not empty, this node does not initial sync, it just starts syncing normally. If the oplog is empty, then initial sync is necessary, continue to step #2:檢查oplog,如果空的,需要進行初始化同步,否則進行普通的同步。
2.Get the latest oplog time from the source member: call this time start.取同步來源節點最新的oplog time,標記為start
3.Clone all of the data from the source member to the destination member.複製所有資料到目標節點
4.Build indexes on destination. 目標節點建索引,2.0版本包含在複製資料步驟裡,2.2在複製資料後建索引。
5.Get the latest oplog time from the sync target, which is called minValid.取目標節點最新的oplog time,標記為minValid
6.Apply the sync target’s oplog from start to minValid.在目標節點執行start 到minValid之間的oplog
7.Become a “normal” member (transition into secondary state).成為正常的成員
個人理解,start 到minValid之間的oplog是複製過來的沒有執行的oplog,沒有完成最終一致性的那部分,就是一個oplog replay的過程。
查看源碼rs_initialsync.cpp,同步初始化步驟如下:
/**
Do the initial sync for this member. There are several steps to this process:
*
Record start time.
Clone.
Set minValid1 to sync target’s latest op time.
Apply ops from start to minValid1, fetching missing docs as needed.
Set minValid2 to sync target’s latest op time.
Apply ops from minValid1 to minValid2.
Build indexes.
Set minValid3 to sync target’s latest op time.
Apply ops from minValid2 to minValid3.
*
At that point, initial sync is finished. Note that the oplog from the sync target is applied
three times: step 4, 6, and 8. 4 may involve refetching, 6 should not. By the end of 6,
this member should have consistent data. 8 is “cosmetic,” it is only to get this member
closer to the latest op time before it can transition to secondary state.
*/
clone data 複製資料的過程:
for each db on sourceServer:
for each collection in db:
for each doc in db.collection.find():
destinationServer.getDB(db).getCollection(collection).insert(doc)
初始化同步特點
好處:資料更緊湊,節省磁碟空間,因為所有操作都是insert。注意padding factor會設定為1。
不好的地方:同步速度太慢。使用fsync+lock 加寫鎖複製資料檔案同步更快。
另外,mongodump/mongorestore來恢複不帶oplog,實際上不太適合作為“從備份恢複”的策略。
從哪個成員來同步資料(Who to sync from)
MongoDB初始化同步資料的時候,可能從主節點同步,也可能是從從節點同步,根據最鄰近原則,選擇最鄰近節點去同步資料;
By default, the member syncs from the the closest member of the set that is either the primary or another secondary with more recent oplog entries. This prevents two secondaries from syncing from each other.
http://docs.mongodb.org/manual/core/replication-internals/
如在上一篇文章提到的日誌裡[rsSync] replSet syncing to: 10.0.0.106:20011
這裡syncing to 實際上是syncing from的意思,由於版本相容原因沿用,正如kristina chodorow 所說’Backwards compatibility sucks’.
Replica Sets通過選擇最鄰近的節點(基於ping值),通過如下演算法選擇從哪個節點同步:
for each member that is healthy:
if member[state] == PRIMARY
add to set of possible sync targets
if member[lastOpTimeWritten] > our[lastOpTimeWritten]
add to set of possible sync targets
sync target = member with the min ping time from the possible sync targets
對於節點是否healthy的判斷,各個版本不同,但是其目的都是找出正常運轉的節點。在2.0版本中,它的判斷還包括了salve delay這個因素。
在secondary運行db.adminCommand({replSetGetStatus:1}) 或者rs.status()命令來查看當前的節點狀況,可以看到syncingTo這個欄位,這個欄位的值就是這個secondary的同步來源。
2.2新增replSetSyncFrom命令,可以指定從哪個節點同步資料。
db.adminCommand( { replSetSyncFrom: "[hostname]:[port]" } )
或者
rs.syncFrom("[hostname]:[port]")
如何選擇最鄰近的節點?看源碼,最新2.2.2為例:mongodb-src-r2.2.2/src/mongo/db/repl/rs_initialsync.cpp
Member* ReplSetImpl::getMemberToSyncTo() {
lock lk(this);
bool buildIndexes = true;
// if we have a target we’ve requested to sync from, use it
if (_forceSyncTarget) {
Member* target = _forceSyncTarget;
_forceSyncTarget = 0;
sethbmsg( str::stream() << “syncing to: ” << target->fullName() << ” by request”, 0);
return target;
}
Member* primary = const_cast<Member*>(box.getPrimary());
// wait for 2N pings before choosing a sync target
if (_cfg) {
int needMorePings = config().members.size()*2 – HeartbeatInfo::numPings;
if (needMorePings > 0) {
OCCASIONALLY log() << “waiting for ” << needMorePings << ” pings from other members before syncing” << endl;
return NULL;
}
buildIndexes = myConfig().buildIndexes;
// If we are only allowed to sync from the primary, return that
if (!_cfg->chainingAllowed()) {
// Returns NULL if we cannot reach the primary
return primary;
}
}
// find the member with the lowest ping time that has more data than me
// Find primary’s oplog time. Reject sync candidates that are more than
// MAX_SLACK_TIME seconds behind.
OpTime primaryOpTime;
static const unsigned maxSlackDurationSeconds = 10 * 60; // 10 minutes
if (primary)
primaryOpTime = primary->hbinfo().opTime;
else
// choose a time that will exclude no candidates, since we don’t see a primary
primaryOpTime = OpTime(maxSlackDurationSeconds, 0);
if ( primaryOpTime.getSecs() < maxSlackDurationSeconds ) {
// erh – I think this means there was just a new election
// and we don’t yet know the new primary’s optime
primaryOpTime = OpTime(maxSlackDurationSeconds, 0);
}
OpTime oldestSyncOpTime(primaryOpTime.getSecs() – maxSlackDurationSeconds, 0);
Member *closest = 0;
time_t now = 0;
// Make two attempts. The first attempt, we ignore those nodes with
// slave delay higher than our own. The second attempt includes such
// nodes, in case those are the only ones we can reach.
// This loop attempts to set ‘closest’.
for (int attempts = 0; attempts < 2; ++attempts) {
for (Member *m = _members.head(); m; m = m->next()) {
if (!m->hbinfo().up())
continue;
// make sure members with buildIndexes sync from other members w/indexes
if (buildIndexes && !m->config().buildIndexes)
continue;
if (!m->state().readable())
continue;
if (m->state() == MemberState::RS_SECONDARY) {
// only consider secondaries that are ahead of where we are 只考慮OpTime在當前節點之前的節點
if (m->hbinfo().opTime <= lastOpTimeWritten)
continue;
// omit secondaries that are excessively behind, on the first attempt at least.
if (attempts == 0 &&
m->hbinfo().opTime < oldestSyncOpTime)
continue;
}
// omit nodes that are more latent than anything we’ve already considered 忽略ping值延遲的
if (closest &&
(m->hbinfo().ping > closest->hbinfo().ping))
continue;
if (attempts == 0 &&
(myConfig().slaveDelay < m->config().slaveDelay || m->config().hidden)) {
continue; // skip this one in the first attempt
}
map<string,time_t>::iterator vetoed = _veto.find(m->fullName());
if (vetoed != _veto.end()) {
// Do some veto housekeeping
if (now == 0) {
now = time(0);
}
// if this was on the veto list, check if it was vetoed in the last “while”. 判斷是否在否決列表裡
// if it was, skip.
if (vetoed->second >= now) {
if (time(0) % 5 == 0) {
log() << “replSet not trying to sync from ” << (*vetoed).first
<< “, it is vetoed for ” << ((*vetoed).second – now) << ” more seconds” << rsLog;
}
continue;
}
_veto.erase(vetoed);
// fall through, this is a valid candidate now
}
// This candidate has passed all tests; set ‘closest’ 滿足所有條件,設定為最近的節點
closest = m;
}
if (closest) break; // no need for second attempt
}
if (!closest) {
return NULL;
}
sethbmsg( str::stream() << “syncing to: ” << closest->fullName(), 0);
return closest;
}