Make a little progress every day-disk cache in Linux

Source: Internet
Author: User
For more information, see http: blogcsdnnetcywosparticledetails21126161.

Some time ago, when developing a system that uses SSD for caching, a large number of disk caches may occur during high-speed data writing. If too many disk caches are not written to the disk in time, it is very dangerous when the machine encounters problems. this will cause a lot of data loss, however, if the data is flushed into the disk in real time, the write efficiency is too low. In order to understand this disk write feature in Linux, I have recently learned a bit about it. The existence of VFS (Virtual File System) allows Linux to be compatible with different File systems, such as ext3, ext4, xfs, and ntfs, it not only serves to implement a common external interface for all file systems, but also plays an important role related to system performance-caching. VFS introduces the high-speed disk cache mechanism, which is a software mechanism that allows the kernel to save some information on the original disk in RAM, in this way, further access to the data can be quickly performed without having to access the disk slowly. High-speed disk cache can be roughly divided into the following three types:

Directory item high-speed cache-mainly stores directory item objects that describe the path name of the file system

Index node high-speed cache-mainly stores the index node objects that describe disk index nodes

High-speed page cache-stores complete data page objects. The data contained on each page must belong to a certain file. at the same time, all file read/write operations depend on the high-speed page cache. It is the main disk cache used by the Linux kernel. Because of the introduction of cache, the VFS file system adopts the file data delay writing technology. Therefore, if the synchronous write mode is not used when calling the system interface to write data, therefore, most of the data will be stored in the cache first, and the data will not be flushed into the disk until certain conditions are met.

How does the kernel fl data into the disk? After reading the following two points, you can get the answer.

1. writing dirty pages to the disk as we know, the kernel constantly fills the page cache with pages containing block device data. As long as the process modifies the data, the corresponding page is marked as a dirty page, that is, the position of its PG_dirty flag. Unix systems allow delayed execution of operations that write dirty buffers into block devices because this policy can significantly improve system performance. Several write operations on pages in the cache may only need to perform a slow physical update on the corresponding disk block. In addition, write operations are not as pressing as read operations, because the process generally does not suspend due to delayed write, and in most cases, it suspends due to delayed read. It is precisely because of the delay in writing that any physical block device provides services for read requests on average more than write requests. A dirty page may remain in the primary storage until the last moment (that is, until the system is disabled. However, from the limitations of the latency write policy, there are two main disadvantages: 1. if a hardware error or power failure occurs, the RAM content cannot be obtained, therefore, many modifications to the file have been lost since the system was started. 2. the page cache size (the size of the RAM required for storing it) can be very large-at least different from the size of the Accessed block device. Therefore, refresh (write) the dirty page to the disk under the following conditions:

The page cache is too full, but more pages or too many dirty pages are required.

It has been too long since the page turns into a dirty page.

Process requests refresh any pending changes to block devices or specific files. It is implemented by calling the sync (), fsync (), or fdatasync () system call. The introduction of the buffer page is more complicated. The buffer header related to each buffer page enables the kernel to understand the status of each independent block buffer. If at least one PG_Dirty flag of the buffer header is set, you should set the PG_dirty flag of the corresponding buffer page. When the kernel selects the buffer to be refreshed, it scans the corresponding buffer header and only effectively writes the contents of dirty blocks to the disk. Once the kernel refreshes all dirty pages in the buffer zone to the disk, the PG_dirty mark of the page is cleared to 0.

2. in earlier versions of Linux, pdflush kernel threads use bdfllush kernel threads to systematically scan pages and cache pages to search for dirty pages to be refreshed, the other kernel thread kupdate is used to ensure that all pages are not "dirty" for a long time. Linux 2.6 replaces the preceding two threads with a set of general kernel threads pdflush. These kernel thread structures are flexible. they act on two parameters: a pointer to the function to be executed by the thread and a parameter to be used by the function. In the system, the number of pdflush kernel threads needs to be dynamically adjusted: when pdflush threads are too small, they are created and killed when there are too many threads. Because the functions executed by these kernel threads can be blocked, creating multiple threads instead of one pdflush kernel thread can improve system performance. Control the generation and extinction of the pdflush thread based on the following principles:

There must be at least two, up to eight pdflush kernel threads

If no pdflush is available in the last 1 s, a new pdflush thread should be created.

If the last time pdflush becomes idle for more than 1 s, delete a pdflush thread. all pdflush kernel threads have pdflush_work descriptors. The data structure is as follows:


Type Field Description
Struct task_struct Who Pointer to the kernel thread descriptor
Void (*) (unsigned long) Fn The callback function executed by the kernel thread
Unsigned long Arg0 Parameters for the callback function
Struct list head List Link of the pdflush_list linked list
Unsigned long When_ I _went_to_sleep Time when the kernel thread is available (expressed as jiffies)

When the system does not have a dirty page to be refreshed, the pdflush thread automatically goes to sleep state and is finally awakened by the pdflush_operation () function. In this process, what are the main tasks of the pdflush kernel thread? Some work is related to refreshing dirty data. In particular, pdflush usually executes one of the following callback functions: 1. background_writeout (): The system scans the page cache to search for dirty pages to be refreshed.

To obtain the dirty pages to be refreshed, you need to thoroughly search all the address_space objects (a search tree) corresponding to the index nodes with images on the disk ). Because the page cache may have a large number of pages, if you use a separate execution stream to scan the entire cache, the CPU and disk will be busy for a long time. therefore, linux uses a complex mechanism to divide the scanning of page cache into several execution streams. When the memory is insufficient or the user explicitly calls a refresh request (user-state process sends a sync () system call, etc.), the wakeup_bdflush () function is executed. The wakeup_bdflush () function calls pdflush_operation () to wake up the pdflush kernel thread and delegates it to execute the callback function background_writeout (). The background_writeout () function effectively retrieves a specified number of dirty pages from the page cache and writes them back to the disk. In addition, the pdflush kernel thread that executes the background_writeout () function can be awakened only when the following two conditions are met: first, the page content in the page cache is modified, second, the dirty page is increased to exceed a dirty background threshold. The background threshold is usually set to 10% of all pages in the system. However, you can modify the value by modifying the file/proc/sys/vm/dirty_background_ratio.

2. wb_kupdate (): Check whether pages in the cache are "dirty" for a long time to avoid hunger when some pages are not refreshed for a long time.

The kernel will establish a wb_timer dynamic timer during initialization. the timer interval is the number of 1% seconds specified in the dirty_writeback_centisecs file (usually one second from 500, however, you can modify the value by modifying the/proc/sys/vm/dirty_writeback_centisecs file ). The timer function calls the pdflush_operation () function and passes in the address of the wb_kupdate () function. The wb_kupdate () function traverses pages and caches old dirty index nodes to search for pages that have been kept dirty for more than 30 seconds. then, it resets the timer.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.