Original: http://weibo.com/p/1001603830912709174661
Write mmap memory and file generates hundreds of MS latency reasonApril 12, 2015 21:10 reading 4274 Recently saw a bug introduction, the author spent 4 months tracking location, found that the JVM statistics will cause the garbage collection process to pause good hundreds of Ms. The JVM statistics are written to a memory area that mmap the file to/tmp.
For more information on this bug, see the following link:
The four Month BUG:JVM statistics cause garbage collection pauses
Link: Http://www.tuicool.com/articles/rime6bV, original link: Http://www.tuicool.com/articles/goto?id=rime6bV
We know that the Linux kernel accelerates read and write access to the block device through the page cache, and when the application writes the file, it writes to the page cache, which is returned in the kernel, and the flush process in the background writes the data back to the block device. There are two ways to access the file on the block device, one is the standard way of file access, Open/read/write/close, and the other is to map a part of the file to the memory area by MMAP, the program reads the corresponding memory area directly.
In this case, modify the memory corresponding to the mmap, which will not cause a pause of hundreds of Ms. The reality is that there is a pause, why?
I sent this issue to the department's internal Mail group, and a colleague found the following information:
Http://www.evanjones.ca/linux-write-caching.html
May want to FQ, I excerpt: ...
. While the percentage of dirty pages are less than Dirty_background_ratio (default:10% on my system), then dirty pages stay In memory until they is older than dirty_expire_centisecs (default:30 seconds). The Pdflush kernel process wakes up every dirty_writeback_centisecs to flush these expired pages out.
. If a writing process dirties enough pages, the percentage rises above dirty_background_ratio, then it proactively wake s Pdflush to start writing data out in the background.
. If the percentage of dirty pages rises above dirty_ratio (default:20% on my system) and then the writing process itself would Synchronously write pages out to disk. This puts the process in ' uninterruptable sleep ' (indicated by a D in top). The CPU would be shown in the ' iowait ' state. This was actually idle time:if there were processes that needed CPU, they would was scheduled to run.
. The percentages is of the total reclaimable memory (Free + active + inactive from/proc/meminfo). On a 32-bit system, the ' high memory ' region was excluded if Vm_highmem_is_dirtyable is 0 (default).
......
From the 3rd above, it can be seen that when the dirty page ratio exceeds dirty_ratio, when the process generates dirty pages, it is changed from asynchronous to synchronous, waiting for a certain percentage of dirty pages to be written back to the block device before returning.
Search the kernel code, after the discovery process writes the data, will call the function balance_dirty_pages, this function in the file mm/page-writeback.c, the detailed code call path is no longer an example.
See below for kernel version 2.6.32.61 corresponding balance_dirty_pages, the key code is as follows:
unsigned long pause = 1; /* Initial blocking time is 1ms */
for (;;) {
WRITEBACK_INODES_WBC (&WBC); /* Initiate writeback data request to block device without blocking/
__set_current_state (task_interruptible);
Io_schedule_timeout (pause); /* Give up CPU, wait for pause before continuing */
/*
* Increase the delay for Each loop, up to our previous
* Default of taking a 100ms nap.
* Each cycle, such as doubling time, up to 100ms */
*/
Pause <<= 1;
if (Pause > HZ/10)
Pause = HZ/10;
}
As you can see from the code above, the process of writing to the data may wait many times until the dirty data scale meets the requirements before exiting, and the longest wait time is 100ms.
It seems that when there is too much dirty data, and the block device writing ability can not keep up, the kernel will force suspend the process of writing data, slow down the speed of data writing, but the program to the block device write data is not real-time, and the delay is not controllable, up to hundreds of MS is normal, and other processes will be affected by large write It's not fair.
Besides, I think there are other problems. It was originally written by the background flusher process, so that the throughput of the block device could be maximized. In doing so, many requests initiate write requests that disrupt the plan of the block device, and for the hard disk, the head will generate more seek, which will result in a poor throughput of the block device, which will only allow the block device to write to the data faster than the data is written. In addition, multiple processes simultaneously go to the write queue, which can also generate a lot of lock contention.
Then look at the kernel 3.10.73 code, this corresponds to the centos7,balance_dirty_pages of the delay-related code is as follows:
for (;;) {
__set_current_state (task_killable);
Io_schedule_timeout (pause);
}
Kernel 3 code has changed, pause time between 10ms to 200ms, the dynamic calculation, the algorithm is a bit complex, did not see too understand, have the opportunity to look again. In addition, the process does not initiate a request for data back and only suspends for a period of time to slow down the writing of the data.
It appears that, with a high percentage of dirty pages, the process does not initiate write-back requests, so that the throughput of the block device is not affected by the kernel 3. However, throttling measures still exist so that when the process writes the data, it will be suspended as well, resulting in a relatively large delay.
Simply combed through the code, the kernel in the writing of files in the context of throttling is not very clear, and then check the relevant documents, such as the next, for interested students to further trace.
Dynamic writeback throttling, Link: http://lwn.net/Articles/405076/
NO-I/O Dirty Throttling, Link: http://lwn.net/Articles/456904/
As you can see from the documentation above, Intel's Wu has submitted patches in this area. Kernel developers on the algorithm also expressed too complex, it seems that I am not only a person's feeling. Oh.
Linux Storage, Filesystem, and Memory Management summit-day 1, Link: http://lwn.net/Articles/490114/
From the above document, it is another reason to see that stable pages also causes write pauses. is preparing to investigate, and later found that the online already has a ready-made introduction, as follows:
Write mmap memory slow reason, link: http://www.360doc.com/content/12/0309/10/2459_192929550.shtml
Mmap Internals, Link: http://blog.chinaunix.net/uid-20662820-id-3873318.html
The next issue to be concerned with is the isolation of the Linux container, in which a process in a container is written to a large number of files, and the process of the other container is not suspended as a result of writing the file data. The initial feeling is that the containers will interact with each other and this needs further confirmation.
[Reprint] Write mmap memory and file causes hundreds of MS latency