MongoDb move chunk Fault Analysis and Processing (SERVER-5351)

Source: Internet
Author: User

Version: 2.2.0InvolvedError: SERVER-5351
Operation: Move the data chunk from one Shard to another shard (as described below: Move shard3_2 to shard1_1)Symptom: 1) the move chunk command on mongos is stuck after execution and logs are found:

Tue Mar  4 20:34:12 [conn25194] CMD: movechunk: { moveChunk: "archive.archive", to: "shard1_1", find: { uid: -2130706432 } }Tue Mar  4 20:34:12 [conn25194] moving chunk ns: archive.archive moving ( ns:archive.archive at: shard3_2:shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003 lastmod: 1|2||000000000000000000000000 min: { uid: -2130706432 } max: { uid: MaxKey }) shard3_2:shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003 -> shard1_1:shard1_1/10.1.2.12:13003,192.168.130.11:13001,192.168.130.12:13002Tue Mar  4 20:34:20 [LockPinger] cluster 192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001 pinged successfully at Tue Mar  4 20:34:20 2014 by distributed lock pinger '192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001/server-a08:27227:1392462352:1804289383', sleeping for 30000ms
2) This record is found in the primary log of shard3_2, and this log lasts for 10 hours:
Wed Mar  5 05:34:57 [conn4754510] moveChunk data transfer progress: { active: true, ns: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", min: { uid: -2130706432 }, max: { uid: MaxKey }, shardKeyPattern: { uid: 1 }, state: "catchup", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Finally, there is a failure log:
Wed Mar  5 06:40:13 [conn4754510] distributed lock 'archive.archive/server-a03:15001:1349773025:1657874366' unlocked.Wed Mar  5 06:40:13 [conn4754510] about to log metadata event: { _id: "server-a03-2014-03-04T22:40:13-163", server: "server-a03", clientAddr: "192.168.130.18:21606", time: new Date(1393972813700), what: "moveChunk.from", ns: "archive.archive", details: { min: { uid: -2130706432 }, max: { uid: MaxKey }, step1 of 6: 1, step2 of 6: 442, step3 of 6: 9, step4 of 6: 36000896, note: "aborted" } }Wed Mar  5 06:40:13 [conn4754510] command admin.$cmd command: { moveChunk: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", to: "shard1_1/10.1.2.12:13003,192.168.130.11:13001,192.168.130.12:13002", fromShard: "shard3_2", toShard: "shard1_1", min: { uid: -2130706432 }, max: { uid: MaxKey }, maxChunkSizeBytes: 134217728, shardId: "archive.archive-uid_-2130706432", configdb: "192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001", secondaryThrottle: false, $auth: { local: { __system: 2 } } } ntoreturn:1 keyUpdates:0 locks(micros) r:66 w:28 reslen:398 36361617ms
3) This record is found in the primary log of shard1_1, and this log lasts for 10 hours:
Tue Mar  4 22:14:47 [migrateThread] warning: migrate commit waiting for 3 slaves for 'archive.archive' { uid: -2130706432 } -> { uid: MaxKey } waiting for: 5315c844:6
Last failure log:
Wed Mar  5 06:40:13 [migrateThread] about to log metadata event: { _id: "server-a01-2014-03-04T22:40:13-161", server: "server-a01", clientAddr: ":27017", time: new Date(1393972813571), what: "moveChunk.to", ns: "archive.archive", details: { min: { uid: -2130706432 }, max: { uid: MaxKey }, step1 of 5: 9, step2 of 5: 0, step3 of 5: 0, step4 of 5: 0, note: "aborted" } }
4) There is no log in secondary of shard1_1, which indicates that moving shard is transparent to secondary. to sum up, (the primary of shardx_y is shardx_y_pri, and the secondary is shardx_y_bjs) the process is like this. chunk should be moved from shard3_2 to chunk1_1. According to the mongodb design, data is pulled from shard1_1_pri to shard3_2_pri. shard1_1_pri is returned only after the data is synchronized to several secondary. in our migration, the primary of shard1_1 has been waiting for the oplog to complete synchronization to other secondary tasks, such as logs: warning: migrate commit waiting for 3 slaves. shard3_2 also keeps checking the migration status of shard1_1.
Question: 1) in the log
Tue Mar  4 22:14:47 [migrateThread] warning: migrate commit waiting for 3 slaves for 'archive.archive' { uid: -2130706432 } -> { uid: MaxKey } waiting for: 5315c844:6
The last timestamp in is the oplog timestamp of mongodb. primary is waiting for the oplog Of This timestamp to be written to secondary. but I went to the secondary data table and saw that other common database operations (non-move chunk) had already been synchronized to secondary, so we can conclude that this oplog has been executed, it indicates that there is no latency problem. 2) for this phenomenon, we got a reply from the 10gen Engineer (programmer Yuan who graduated from beiyou) on google group: This fault is estimated to trigger a bug in mongodb2.2.0, we recommend that you upgrade to 2.2.7. bug see link: https://jira.mongodb.org/browse/SERVER-53513) So I searched the previously successful move chunk operation log, as shown below
Wed Aug 21 01:31:35 [migrateThread] warning: migrate commit waiting for 2 slaves for 'post.post' { uid: 1107296256 } -> { uid: MaxKey } waiting for: 5213a7f7:4
It's strange that we have three secondary, but why is 2 displayed here? "Waiting for 2 slaves for 'Post. post' "(normal)" waiting for 3 slaves for 'archive. archive '"(exception) 4) the same cluster, the same replica set, is a 2 to 3, so I followed some code
// D_migrate.cppclass MigrateStatus {void _ go (){//...... // pause to wait for replication Timer t; while (t. minutes () <600) {// check the synchronization progress cyclically within 10 hours if (flushPendingWrites (lastOpApplied) break; sleepsecs (1) ;}} bool flushPendingWrites (const ReplTime & lastOpApplied) {if (! OpReplicatedEnough (lastOpApplied) {OpTime op (lastOpApplied); // here is where the waiting for 3 slaves log is stored. OCCASIONALLY warning () <"migrate commit waiting for" <slaveCount <"slaves for '" <ns <"'" <min <"->" <max <"waiting: "<op <migrateLog; return false ;}//...} // The following two functions check the synchronization progress. Note that it is not a reference to pass the value. // That is to say, slaveCount is calculated by the previous code, bool opReplicatedEnough (const ReplTime & lastOpApplied) {return mongo: Random (lastOpApplied, slaveCount);} bool opReplicatedEnough (OpTime op, BSONElement w) {return slaveTracking. opReplicatedEnough (op, w );}};

Okay, we can see in flushPendingWrites that the number of slave is the printed variable. The key is mongo: opReplicatedEnough's slaveCount. How does he calculate it? The calculation method of this variable can be found: slaveCount = (getSlaveCount ()/2) + 1; (in version 2.2.7, slaveCount is no longer used) That is to say, primary is waiting for more than half of the secondary synchronization to complete. our replica set has three secondary logs. That is to say, the normal logs are printed as 3/2 + 1 = 2, but when a fault occurs, it is printed as 3, that is, 4/2 + 1 or 5/2 + 1. That is to say, this mongod thinks that there are 4 or 5 secondary. It can be guessed that an internal database exception has occurred.
Processing: In this case, the database may only be upgraded as the 10gen engineer said, but the Database Upgrade cannot be performed quickly. We have hundreds of mongod clusters, and we have to find other ways to deal with this bug1 first) let's see how this slaveCount comes from: mongo: opReplicatedEnough (lastOpApplied, SlaveCount); If the opTime of slaveCount secondary is greater than lastOpApplied, true is returned. go deep into the function and find the code:
bool _replicatedToNum_slaves_locked(OpTime& op, int numSlaves ) {    for ( map<Ident,OpTime>::iterator i=_slaves.begin(); i!=_slaves.end(); i++) {    OpTime s = i->second;    if ( s < op ) {        continue;    }    if ( --numSlaves == 0 )        return true;    }    return numSlaves <= 0;}
Here we can see that this variable _ slaves stores the seconday information and finds its definition: map <Ident, OpTime> _ slaves; 2) Here, basically, in the faulty mongod, _ slaves contains four or five elements, so I thought, if I can find a solution, it would be better to clear or reset _ slaves, so we found the place to modify _ slaves:
// Repl_block.cpp: 111 void reset () {if (_ currentlyUpdatingCache) return; scoped_lock mylk (_ mutex); _ slaves. clear (); // clear} // repl_block.cpp: 118 void update (const BSONObj & rid, const string & host, const string & ns, OpTime last) {REPLDEBUG (host <"" <rid <"" <ns <"" <last); Ident ident (rid, host, ns ); scoped_lock mylk (_ mutex); _ slaves [ident] = last; // modify _ dirty = true ;//.....}

Yes and only the two locations change _ slaves directly. then I looked at the code and found that _ slaves will be reset when the database is closed and when the replica configuration is modified. because _ slaves. there are still a lot of upper-Layer Code for clear (), and you also want to see if there are other easier methods, and you can reset _ slaves if you are sure to restart or modify the replica configuration, therefore, do not hesitate to directly log in multiple places on the source code and re-compile mongodb to the test environment cluster for testing:
src/mongo/db/repl_block.cpp:115:log(LL_ERROR) << "[yyy] clear _slave in " << " void reset()  _slave.clear()\n";src/mongo/db/repl_block.cpp:126:log(LL_ERROR) << "[yyy] ident:" << rid.toString(false, false) << " " << host << " " << ns << "  |   last:" << last;src/mongo/db/repl/replset_commands.cpp:215:log(LL_ERROR) << "[yyy] clear _slave in " << " CmdReplSetReconfig";src/mongo/db/repl/rs_config.cpp:72:log(LL_ERROR) << "[yyy] clear _slave in " << "saveConfigLocally" << rsLog;src/mongo/db/oplog.cpp:50:log(LL_ERROR) << "[yyy] clear _slave in " << " close database";
Finally, we confirmed the operation that would reset _ slaves: mongod startup and replica Configuration Modification 2) combined with the above inference and test, we thought that restarting mongod could allow mongod to reset _ slaves according to the actual situation, otherwise, it is in daily operation and only needs to modify the configuration of replica. in view of the higher risk of modifying the configuration, we plan to restart mongod. 3) because our cluster has a large number of requests, we should operate in the early morning to ensure security. switch master-wait for Master change and synchronization-close original primary-Restart-wait for synchronization-switch back master-wait for Master change and synchronization-move chunk-success {"millis": 2977, "OK": 1 }!
Follow-up: I followed a lot of mongodb code over the past two days and wrote some code analysis later.















Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.