At the end of the article "MongoDB source code overview-memory management and storage engine", we have left a problem because it relies on the MMAP method of the operating system when using the MongoDB memory management and storage engine, ing files on the disk to the memory space of the process brings great convenience to MongoDB, but also brings us a lot of problems. How often does a persistent hard disk map to a memory view ensure that our server loses the least data when it goes down? Is there any good solution to the possible data disorder caused by downtime in the flushall process?
MongoDB team members started to improve the reliability of standalone machines on the latest branch of version 1.7. This is the journal \ durability module introduced. This module mainly solves the problems raised above, it plays a decisive role in improving the reliability of standalone data. the Mechanism is to regularly record operation logs (operations on Database changes, queries are not within the record range) to the folder named journal in dbpath, in this way, when the system restarts again, the lost data will be restored from this folder.
Next we will give a brief analysis of the source code.
Journal logging Module
The call path of the journal \ durability module is as follows:
Main () -- "initandlisten () --" _ initandlisten () -- "dur: startup ();
Startup ()CodeAs follows:
Void startup () {If (! Using line. durableimpl is used to instantiate the durableinterface: enabledurability (); journalmakedir (); // confirm the log directory try {recover (); // repair mode} catch (...) {log () <"exception during recovery" <Endl; throw;} // pre-allocate two log files preallocatefiles (); Boost: thread t (durthread );}
In the above Code, durableinterface: enabledurability () ensures that the system uses durableimpl to instantiate the internal _ impl variable pointer, which defaults to a nondurableimpl instance. Their relationships are as follows:
Nondurableimpl does not persist any journal, while durableimpl provides journal persistence.
The journalmakedir () function checks whether the log directory exists. If it does not exist, it creates a directory.
The recover () function is responsible for checking existing journal persistence files. If there are related files, it means that the last system went down and data needs to be restored Based on Journal, this part will be discussed later in this article.
Preallocatefiles () to store files for persistent Journal. The system determines whether pre-allocation is required based on the current environment.
Next, the system starts a new thread to run durthread (). to greatly reduce the number of code pasted in the text, I 'd like to describe the process and explain several important steps. After all, I don't think it's interesting to post code,ArticleBloat, but there is little practical useful content. This is why I like to call my article source code overview rather than source code analysis.
Durthread is mainly responsible for commit a journal every 90 milliseconds (record the user's operations on Database changes, query operations no longer record range), it is a separate thread, and record interface, storing journal in the memory is completed when the user calls the journal interface. This part of content has been completed in the MongoDB source code overview-log article,
It can be divided into the following processes:
- Record the flush time of the last MMAP and clear unnecessary log files
When journalrotate () is called, The lsn file is updated. This file is used to record the last time the MMAP file was flushed to the disk. The data comes from the lastflushtime attribute, and the assigned values related to this attribute are as follows:
Void Journal: Init () {assert (_ curlogfile = 0); jsonfile: policypreflush = preflush; // two pointers to the function: policypostflush = postflush; // used to simulate Event Notification} void Journal: preflush () {J. _ preflushtime = listener: getelapsedtimemillis (); // obtain the initial time after system startup} void Journal: postflush () {J. _ lastflushtime = J. _ preflushtime; J. _ writetolsnneeded = true ;}
So far, we know that its lastflushtime is a value that stores the initial estimated system startup time in the listener class, the value will be updated with lastflushtime when the MMAP view is flushed to the disk (function pointer notification ). In addition, this call will also check whether the journal storage file is fully written. The system sets a different maximum value for the 32-bit and 64-bit environments.
Datalimit = (sizeof (void *) = 4 )? 256*1024*1024: 1*1024*1024*1024;
If the current write position exceeds the maximum value range, it will be called one after another
Closecur1_journalfile (); removeunneededjournalfiles ();
I will not post the codes of these two functions. In fact, it is to close the journal record file that is already full, delete the record files before the last flushtime (multiple journal record files exist at the same time ). Because the changes to this part of the record have been smoothly persisted, and the previous operations of the journal record are no longer required.
- Serialize user operations and make them persistent
Before serialization, the system needs to call commitjob. WI (). _ deferred. invoke (), this function will traverse taskqueue <D> Memory D (recorded in the step of user operation), Run D: Go () one by one (), finally, encapsulate all data in D as writeintent and store it in writes ::_ writes (set <writeintent>). Let's take a closer look at the difference between writeintent and D struct, and the first address of the d storage data source, the official explanation for the first address of the writeintent data source is that this allows us to run the reload character "<" faster "in _ writes (set <writeintent>. I am really puzzled by his practice. Why can't these things be done by a D? Make a writeintent to disturb the sight of the person reading the code.
So far, all writeintent is ready for release in _ writes (set <writeintent>, and the system being prepared serializes it, just like the meat on the render board, the body is ready to wait for the master to cut.
Call preplogbuffer () in _ groupcommit to start the serialization operation of journal.
Alignedbuilder & BB = commitjob. _ AB; // it can be understood as a Buf... for (vector <shared_ptr <durop >:: iterator I = commitjob. OPS (). begin (); I! = Commitjob. OPS (). End (); ++ I) {(* I)-> serialize (bb );}... For (set <writeintent >:: iterator I = commitjob. Writes (). Begin (); I! = Commitjob. Writes (). End (); I ++) {prepbasicwrite_inlock (BB, & (* I), lastdbpath );}
Through the above code, we can know that dudrop serialization is completed by its own serialize method, and their serialization operations do not involve the modified data, so the serialization results can be very concise. For example, a dropdbop detaches a database. To restore the database, you must run the detach process again. Therefore, you only need to use one thing (or even code numbers). The basicwrites is different. For example, if a new record is inserted, we need to record the entire record as the data source for restoration. That's right. This is what prepbasicwrite_inlock, which is not explained above, does.
Jentry E;... BB. appendstruct (E); Bb. appendbuf (I-> Start (), E. Len)
For alignedbuilder, we can understand that the Buf in the serialization process stores the serialized data to be persisted. appendbuf will perform memcopy on the data at the specified location of the parameter, in fact, there are still some problems with serialization, such as the binary data stored in journal log files. Well, don't worry about this name. In addition to the data source, alignedbuilder also puts jentry to indicate some basic attributes. The relationship between jentry and writeintent is, correct addressing can be achieved only when reading.
After all the data is serialized, the system calls writetojournal (commitjob. _ AB) to persist alignedbuilder to the journal log file. Finally, the system calls logfile: synchronousappend to write data to the external storage file. Then the system calls writetodatafiles (). In fact, I am very puzzled when I first read the source code. For example, when we insert data, you have copied the data you want to insert to memcopy once and saved it to the view in the memory. Why does the writetodatafiles still need memcopy? I had a long struggle on this issue and finally found the answer. The secret is that if the Dur mode is enabled,For each memorymappedfile, two views are generated: _ view_private and _ view_write.(MongoDB running on 32-bit systems without the Dur mode enabled, the official saying that the database data cannot exceed 2.5 GB. Now we can see its moisture through this principle, in fact, the optimal size is 1 GB ). The Code is as follows:
Bool initialize MMF: finishopening () {If (_ view_write) {// _ view_write create if (begin line. dur) {_ view_private = createprivatemap (); // create _ view_private if (_ view_private = 0) {massert (13636, "createprivatemap failed (look in log for error )", false);} privateviews. add (_ view_private, this); // note that testintent builds use this, even though it points to view_write then ...} else {// If dur is not allowed, only one view _ view_private = _ view_write;} return true;} return false ;}
Only one of the two views in mongommf can be flushed to the disk, which is the first created view. _ view_write must be the first one, so only the two views can be truly persistent.
Void memorymappedfile: flush (bool sync) {uassert (13056, "async flushing not supported on Windows", sync); If (! Views. Empty () {windowsflushable F (views [0], FD, filename (), _ flushmutex); F. Flush ();}}
We only know that there are two memcopies in the dur mode, but why? There are two different views in this mode. Do you think of anything? In the insert method (pdfile. CPP line 1) Call memcopy to copy the content to _ view_private (pdfile. rows 1596 of CPP show that recordat uses P, P = "_ MB = MMF. getview (); so, in fact, the record is on view_private), it is not a persistent _ view_write, so the writetodatafiles must be copied once, the data source is the copied data on view_private.
Through the above two sections of code, we can also find that in non-dur mode, _ view_private and _ view_write are actually the same thing. This explains why the non-dur mode does not need to perform memcopy twice (writetodatafiles is not run in non-dur mode ).
Okay. So far, all our conclusions have been answered.
Finally, we use a very poor sequence diagram to describe this process (the process is not fully object-oriented ).
Journal recovery module
This module runs when the system is started. It interprets the journal files left over from the last downtime (also through MMAP) the records that are not flushed to the database record file are re-put into _ view_write through memcopy. For the storage engine thread to execute persistence.
If the system Exits normally the last time, the final flush (only dur mode) will be performed in the exit process, and the existing journal file will be cleared, therefore, normal exit will not leave any Journal File.
This part of the operation is also very simple, because the relationship between time, this article will not elaborate on, the time sequence diagram is as follows:
It's not too early. I have to go to bed !!!
I am also looking for friends who love the underlying technology (C/C ++ Linux) to study and create interesting things!