Version: 2.2.0
The following content relates to the previous article <MongoDb move chunk Fault Analysis and Handling (SERVER-5351)>.
In the previous article <MongoDb move chunk Fault Analysis and Handling (SERVER-5351)>, we failed to move chunk. the investigation of the problem refers to the variable vector _ slaves. This variable only performs the clear operation when the release d is disabled and the release d is started, and only adds or modifies the variable at other times. from the move chunk log, we can see that the _ slaves variable of mongod should contain 5/2 + 1 or 4/2 + 1 elements. How can this problem be solved? We only have three secondary.
After code tracing, several points are found
1) rs in secondary mongod. me has a unique record, which we call M for the moment. M has two key elements: _ id and host. _ id is a unique id of the mongod, generated by ObjectId. host is the domain name of the machine deployed by mongod.
2) in the process of moving chunk, primary will obtain a remoteid from secondary, that is, the _ id mentioned above.
3) after secondary is started, once the record M needs to be obtained, it will be checked in rs. me. The source code is as follows:
Bool replHandshake (DBClientConnection * conn) {string myname = getHostName (); BSONObj me; {Lock: DBWrite l ("local"); if (! Helpers: getSingleton ("local. me", me) | // obtain the machine id from the database local. me. This is the so-called _ remoteId! Me. hasField ("host") | me ["host"]. String ()! = Myname) {// if the host in this record is different from the current host of the machine, you need to clear the settings and reset the _ id Helpers: emptyCollection ("local. me "); BSONObjBuilder B; B. appendOID ("_ id", 0, true); // If. if the id is not obtained in me, an id is created as the so-called _ remoteId B. append ("host", myname); me = B. obj (); Helpers: putSingleton ("local. me ", me) ;}} BSONObjBuilder cmd; // encapsulate the id as {" handshake ": _ remoteId}, which is consistent with what we see in the log, _ The key of remoteId in bsonobj is "handshake" cmd. appendAs (me ["_ id"], "handshake"); if (theReplSet) {cmd. append ("member", theReplSet-> selfId ();} BSONObj res; bool OK = conn-> runCommand ("admin", cmd. obj (), res); // execute the handshake command return true ;}
4) Therefore, if a machine used as the secondary D of secondary has changed the domain name, then the secondary D has two rs. in me, the document _ id is different. We call id1 as the _ id before the domain name is changed, and id2 as the _ id after the domain name is changed. so there is an optime1 and optime2. so (id1, optime1) and (id2, optime2) are both in _ slaves, but it indicates the same mongod, and optime1 is outdated. as a result, the number of elements in _ slaves is more than that in secondary. (ps: _ slaves has more than two values, but these two values are the most important and unique)
5) after confirmation, our cluster has indeed experienced data center migration, domain name modification, and the secondary deployed machines. in addition, I tried to build another cluster. After modifying the domain name twice, the move chunk failed, and the log output was the same as that.
So far, let's summarize
Contact another blog post <MongoDb move chunk Fault Analysis and Handling>. The _ slaves variable is primary used to determine the number of secondary synchronization states to be tracked. map <Ident, the key value of OpTime> _ slaves is the unique identifier of the secondary machine, and the value is the state to which the Oplog is synchronized. if you used to synchronize data with primary as secondary once, the Ident (_ remoteId and host) of the machine will be recorded in primary's _ slaves, the ip address is not modified) and ns (the log shows local. oplog. rs), Ident determines the unique _ slaves record.
If the domain name has been modified, the new Ident will be inserted into _ slaves, and the corresponding opTime will be very outdated, resulting in movechunk, an error occurred while synchronizing the majority of secondary requests.