版本: 2.2.0涉及BUG: SERVER-5351
操作: 把資料chunk從一個分區挪到另外一個分區(下面描述為: 從分區shard3_2 挪到 分區shard1_1)現象: 1) mongos 上move chunk命令執行後卡住, 發現日誌:
Tue Mar 4 20:34:12 [conn25194] CMD: movechunk: { moveChunk: "archive.archive", to: "shard1_1", find: { uid: -2130706432 } }Tue Mar 4 20:34:12 [conn25194] moving chunk ns: archive.archive moving ( ns:archive.archive at: shard3_2:shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003 lastmod: 1|2||000000000000000000000000 min: { uid: -2130706432 } max: { uid: MaxKey }) shard3_2:shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003 -> shard1_1:shard1_1/10.1.2.12:13003,192.168.130.11:13001,192.168.130.12:13002Tue Mar 4 20:34:20 [LockPinger] cluster 192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001 pinged successfully at Tue Mar 4 20:34:20 2014 by distributed lock pinger '192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001/server-a08:27227:1392462352:1804289383', sleeping for 30000ms
2)在shard3_2的primary日誌裡面發現這種記錄,並且這種日誌持續10個小時:
Wed Mar 5 05:34:57 [conn4754510] moveChunk data transfer progress: { active: true, ns: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", min: { uid: -2130706432 }, max: { uid: MaxKey }, shardKeyPattern: { uid: 1 }, state: "catchup", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
最後有個失敗日誌:
Wed Mar 5 06:40:13 [conn4754510] distributed lock 'archive.archive/server-a03:15001:1349773025:1657874366' unlocked.Wed Mar 5 06:40:13 [conn4754510] about to log metadata event: { _id: "server-a03-2014-03-04T22:40:13-163", server: "server-a03", clientAddr: "192.168.130.18:21606", time: new Date(1393972813700), what: "moveChunk.from", ns: "archive.archive", details: { min: { uid: -2130706432 }, max: { uid: MaxKey }, step1 of 6: 1, step2 of 6: 442, step3 of 6: 9, step4 of 6: 36000896, note: "aborted" } }Wed Mar 5 06:40:13 [conn4754510] command admin.$cmd command: { moveChunk: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", to: "shard1_1/10.1.2.12:13003,192.168.130.11:13001,192.168.130.12:13002", fromShard: "shard3_2", toShard: "shard1_1", min: { uid: -2130706432 }, max: { uid: MaxKey }, maxChunkSizeBytes: 134217728, shardId: "archive.archive-uid_-2130706432", configdb: "192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001", secondaryThrottle: false, $auth: { local: { __system: 2 } } } ntoreturn:1 keyUpdates:0 locks(micros) r:66 w:28 reslen:398 36361617ms
3)在shard1_1的primary日誌裡面發現這種記錄, 並且這種日誌持續10個小時:
Tue Mar 4 22:14:47 [migrateThread] warning: migrate commit waiting for 3 slaves for 'archive.archive' { uid: -2130706432 } -> { uid: MaxKey } waiting for: 5315c844:6
最後個失敗日誌:
Wed Mar 5 06:40:13 [migrateThread] about to log metadata event: { _id: "server-a01-2014-03-04T22:40:13-161", server: "server-a01", clientAddr: ":27017", time: new Date(1393972813571), what: "moveChunk.to", ns: "archive.archive", details: { min: { uid: -2130706432 }, max: { uid: MaxKey }, step1 of 5: 9, step2 of 5: 0, step3 of 5: 0, step4 of 5: 0, note: "aborted" } }
4)在shard1_1的secondary裡面沒有任何日誌, 說明move shard對於secondary其實是透明的.綜上,(下面稱shardx_y的primary為shardx_y_pri, 稱其secondary為shardx_y_bjs)流程是這樣的,chunk要從shard3_2挪到chunk1_1, 根據mongodb的設計, 資料由shard1_1_pri 向shard3_2_pri拉取資料, shard1_1_pri會等資料同步到若干個secondary之後才返回. 在我們的遷移中, shard1_1的primary一直在等待oplog完成同步到其他secondary的工作,如日誌:warning: migrate commit waiting for 3 slaves for. 另一方面shard3_2 也在不停檢查shard1_1的遷移狀態.
疑問:1) 在日誌
Tue Mar 4 22:14:47 [migrateThread] warning: migrate commit waiting for 3 slaves for 'archive.archive' { uid: -2130706432 } -> { uid: MaxKey } waiting for: 5315c844:6
中最後有一個時間戳記, 是一個mongodb的oplog的時間戳記, primary在等待這個時間戳記的oplog寫入到secondary. 但是我去secondary的資料表裡面看到, 其他普通的資料庫操作(非move chunk)早已同步到secondary裡了, 那麼可以斷定這個oplog已經執行了, 說明並不存在延時問題.2) 對於這個現象, 在google group上面得到10gen工程師(是一個北郵畢業的程式媛)的回複: 這個故障估計是觸發mongodb2.2.0的一個bug, 建議我們升級到2.2.7. bug見連結: https://jira.mongodb.org/browse/SERVER-53513) 於是我搜了之前成功move chunk的動作記錄, 如下
Wed Aug 21 01:31:35 [migrateThread] warning: migrate commit waiting for 2 slaves for 'post.post' { uid: 1107296256 } -> { uid: MaxKey } waiting for: 5213a7f7:4
奇怪了,我們明明有3個secondary, 但是為什麼這裡顯示 2 呢?“waiting for 2 slaves for 'post.post'”(正常)“waiting for 3 slaves for 'archive.archive' ” (異常)4) 同樣的叢集, 同樣的replica set, 為啥一個2一個3呢, 於是跟了一些代碼
//d_migrate.cppclass MigrateStatus { void _go() { //...... // pause to wait for replication Timer t; while ( t.minutes() < 600 ) { //10個小時的迴圈檢查同步進度 if ( flushPendingWrites( lastOpApplied ) break; sleepsecs(1); } } bool flushPendingWrites( const ReplTime& lastOpApplied ) { if ( ! opReplicatedEnough( lastOpApplied ) ) { OpTime op( lastOpApplied ); //這裡就是打waiting for 3 slaves 日誌的地方 OCCASIONALLY warning() << "migrate commit waiting for " << slaveCount << " slaves for '" << ns << "' " << min << " -> " << max << " waiting for: " << op << migrateLog; return false; } //... } //下面這兩個函數檢查同步進度,注意不是引用傳值, //也就是說, slaveCount是由之前的代碼計算好的,傳進去opReplicatedEnough之後不會被修改 bool opReplicatedEnough( const ReplTime& lastOpApplied ) { return mongo::opReplicatedEnough( lastOpApplied , slaveCount ); } bool opReplicatedEnough( OpTime op , BSONElement w ) { return slaveTracking.opReplicatedEnough( op , w ); }};
好了,在flushPendingWrites裡面可以看到, slave的數量是列印的這個變數, 關鍵是 mongo::opReplicatedEnough傳進來的 slaveCount ,他是如何計算的?可以找到這個變數的計算方式:slaveCount = ( getSlaveCount() / 2 ) + 1; (在2.2.7版本,計算slaveCount不再使用這個函數)也就是說, primary是等待超過半數的的secondary同步完成, 就ok. 我們的replica set有3個secondary, 也就是說, 正常的日誌列印出來應該是 3/2+1=2 , 但是故障發生的時候, 列印出來是3, 即是4/2+1或者5/2+1, 也就是說這個mongod認為他有有4或者5個secondary, 可以猜測是資料庫內部發生異常了.
處理:這麼看來, 或許只有按10gen的工程師說的那樣升級資料庫了,但是升級資料庫不能很快進行, 我們的叢集上百個mongod, 必須找其他辦法先對付這個bug1) 看一下這個slaveCount是怎麼來的:mongo::opReplicatedEnough( lastOpApplied ,
slaveCount );這個函數的功能就是, 如果有slaveCount個secondary的opTime大於lastOpApplied,那麼返回true. 深入這個函數內部, 找到這個代碼:
bool _replicatedToNum_slaves_locked(OpTime& op, int numSlaves ) { for ( map<Ident,OpTime>::iterator i=_slaves.begin(); i!=_slaves.end(); i++) { OpTime s = i->second; if ( s < op ) { continue; } if ( --numSlaves == 0 ) return true; } return numSlaves <= 0;}
到這裡就發現了,其實就是這個變數_slaves儲存了seconday的資訊, 找到其定義: map<Ident,OpTime> _slaves;2) 到這裡,基本就可以確定,在故障的mongod裡面, _slaves裡面有4或者5個元素, 於是我想到, 如果能找到什麼辦法, 讓_slaves清空或者重設則是再好不過,於是找到了修改_slaves的地方:
//repl_block.cpp:111void reset() { if ( _currentlyUpdatingCache )return; scoped_lock mylk(_mutex); _slaves.clear(); //清除}//repl_block.cpp:118void update( const BSONObj& rid , const string& host , const string& ns , OpTime last ) { REPLDEBUG( host << " " << rid << " " << ns << " " << last ); Ident ident(rid,host,ns); scoped_lock mylk(_mutex); _slaves[ident] = last; //修改 _dirty = true; //.....}
有且只有這兩個地方直接更改 _slaves. 於是接著看代碼, 可以發現,在資料庫關閉的時候和replica的配置修改的時候會重設_slaves. 因為_slaves.clear()的上層代碼還是比較多,並且還想看看有沒有其他更簡便的方法, 並且確認重啟或者修改replica的配置可以重設_slaves, 所以不猶豫直接在源碼上面多個地方打log, 重新編譯mongodb到測試環境叢集去測試:
src/mongo/db/repl_block.cpp:115:log(LL_ERROR) << "[yyy] clear _slave in " << " void reset() _slave.clear()\n";src/mongo/db/repl_block.cpp:126:log(LL_ERROR) << "[yyy] ident:" << rid.toString(false, false) << " " << host << " " << ns << " | last:" << last;src/mongo/db/repl/replset_commands.cpp:215:log(LL_ERROR) << "[yyy] clear _slave in " << " CmdReplSetReconfig";src/mongo/db/repl/rs_config.cpp:72:log(LL_ERROR) << "[yyy] clear _slave in " << "saveConfigLocally" << rsLog;src/mongo/db/oplog.cpp:50:log(LL_ERROR) << "[yyy] clear _slave in " << " close database";
最後確認了會重設_slaves的操作: mongod啟動和replica配置的修改2) 綜合了上面的推論和測試, 認為重啟mongod可以讓mongod根據實際情況重設_slaves, 否則處於日常的運轉中, 只有修改replica的配置. 鑒於修改配置風險更高,所以打算通過重啟mongod的方式處理. 3) 因為我們的叢集請求量很大, 保險起見, 在淩晨操作. 切主 - 等待換主和同步 - 關閉原來的primary - 重啟 - 等待同步 - 把主切回來 - 等待換主和同步 - move chunk - 成功{ "millis" : 2977, "ok" : 1 }!
後續: 這兩天跟了很多mongodb的代碼,後面還寫點程式碼分析.