Write dirty pages to disk

Source: Internet
Author: User

As we know, the kernel constantly uses pages containing block device data to fill pages with high-speed cache. As long as the process modifies the data, the corresponding page is marked as a dirty page, that is, the pg_dirty flag is set.

 

UNIX systems allow delayed execution of operations that write dirty buffers into Block devices because this policy can significantly improve system performance. Several write operations on pages in the cache may only need to perform a slow physical update on the corresponding disk block. In addition, write operations are not as pressing as read operations, because the process generally does not suspend due to delayed write, and in most cases, it suspends due to delayed read. It is precisely because of the delay in writing that any physical block device provides an average of more services for read requests than write requests.

 

A dirty page may remain in the primary storage until the last moment (that is, until the system is disabled. However, the latency write policy has two main drawbacks:

 

(1) If a hardware error or power failure occurs, the RAM content cannot be obtained. Therefore, many modifications to the file are lost since the system was started.

 

(2) The size of the page cache (the size of the ram required for storing it) may be large-at least the size of the accessed block device.

 

Therefore, refresh (write) the dirty page to the disk under the following conditions:

(1) The page cache becomes too full, but more pages or too many dirty pages are required.

(2) It has been too long since the page turns into a dirty page.

(3) process requests to refresh any pending changes to block devices or specific files. It is implemented by calling the sync (), fsync (), or fdatasync () System Call.

 

The introduction of the buffer page makes the problem more complicated. The buffer header related to each buffer page enables the kernel to understand the status of each independent block buffer. If at least one bh_dirty flag of the buffer header is set to a bit, you should set the pg_dirty flag of the corresponding buffer page. When the kernel selects the buffer page to be refreshed, It scans the corresponding buffer header and only effectively writes the contents of dirty blocks to the disk. Once the kernel refreshes all dirty pages in the buffer zone to the disk, the pg_dirty of the page is marked 0.

 

1 pdflush kernel thread

 

In earlier versions of Linux, The bdflush kernel thread system scans pages for high-speed cache to search for dirty pages to be refreshed, in addition, another kernel thread kupdate is used to ensure that all pages are not "dirty" for a long time. Linux 2.6 replaces the preceding two threads with a set of general kernel threads pdflush.

 

These kernel thread structures are flexible. They act on two parameters: a pointer to the function to be executed by the thread and a parameter to be used by the function. In the system, the number of pdflush kernel threads needs to be dynamically adjusted: When pdflush threads are too small, they are created and killed when there are too many threads. Because the functions executed by these kernel threads can be blocked, creating multiple threads instead of one pdflush kernel thread can improve system performance.

 

Control the generation and extinction of the pdflush thread based on the following principles:
-There must be at least two and a maximum of eight pdflush kernel threads.
-If no idle pdflush is available in the last 1 s, a new pdflush should be created.
-If the last time pdflush becomes idle for more than 1 s, delete a pdflush.

 

All pdflush kernel threads have pdflush_work descriptors. The descriptors of idle pdflush kernel threads are concentrated in the pdflush_list linked list. In a multi-processor system, the pdflush_lock spin lock protects the linked list from concurrent access. Nr_pdflush_threads variable (the variable value can be read from the file/proc/sys/Vm/nr_pdflush_threads) to store the total number of pdflush kernel threads (idle or busy. Finally, the last_empty_jifs variable stores the time when the pdflush_list chain table of the pdflush thread becomes null (expressed as jiffies ).

 

Type

Field

Description

Struct task_struct *

Who

Pointer to the kernel thread Descriptor

Void (*) (unsigned long)

FN

The callback function executed by the kernel thread

Unsigned long

Arg0

Parameters for the callback function

Struct list head

List

Link of the pdflush_list linked list

Unsigned long

When_ I _went_to_sleep

Time when the kernel thread is available (expressed as jiffies)

 

All pdflush kernel threads execute the function _ pdflush, which is essentially executed cyclically until the kernel thread dies. Assume that the pdflush kernel thread is idle and the process is sleeping in the task_interruptible state. Once the kernel thread is awakened, __pdflush () accesses its pdflush_work descriptor, executes the callback function in the FN field, and passes the parameters in the arg0 field to the function. When the callback function ends, _ pdflush () checks the value of the last_empty_jifs variable. If no idle pdflush kernel thread exists for more than 1 s and the number of pdflush kernel threads is less than 8, function _ pdflush () creates another kernel thread. Conversely, if the pdflush kernel thread idle time corresponding to the last entry in the pdflush_list chain table exceeds 1 s, and there are more than two pdflush kernel threads in the system, the function _ pdflush () will terminate: as described in the "kernel thread" blog, the corresponding kernel thread executes the _ exit () System Call and is revoked as a result. Otherwise, if there are no more than two pdflush kernel threads in the system, __pdflush () inserts the pdflush_work descriptor of the kernel thread into the pdflush_list list and sleep the kernel thread.

 

The pdflush_operation () function is used to activate idle pdflush kernel threads. This function acts on two parameters: a pointer FN pointing to the required function and the Argo parameter. Perform the following steps for the function:

 

1. Get the PDF pointer from the pdflush_list linked list, which points to the pdflush_work descriptor of the idle pdflush kernel thread. If the linked list is empty,-1 is returned. If there is only one element in the linked list, the value of jiffies is assigned to the variable last_empty_jifs.

 

2. Assign the FN and arg0 parameters to pdf> FN and pdf> arg0 respectively.

 

3. Call wake_up_process () to wake up the idle pdflush kernel thread, that is, PDF-> who.

 

Which jobs are delegated to the pdflush kernel thread to complete? Some work is related to refreshing dirty data. In particular, pdflush usually executes one of the following callback functions:
-Background_writeout (): system scans the page cache to search for dirty pages to be refreshed (see "Search for dirty pages to be refreshed" in the next section ").
-Wb_kupdate (): Check whether there are "dirty" pages in the cache for a long time (see "back to old dirty pages" later ").

 

2. Search for dirty pages to be refreshed

 

All the basic trees may have dirty pages to be refreshed. To get all these pages, you need to thoroughly search all the address_space objects corresponding to the index node with an image on the disk. Because the page cache may have a large number of pages, if you use a separate execution stream to scan the entire cache, it will make the CPU and disk busy for a long time. Therefore, Linux uses a complex mechanism to divide the scanning of page cache into several execution streams.

 

The value 0 indicates that all dirty pages in the cache should be written back to the disk. This function calls pdflush_operation () to wake up the pdflush kernel thread (see the previous section) and delegates it to execute the callback function background_writeout (), the latter effectively retrieves a specified number of dirty pages from the page cache and writes them back to the disk.

 

Run the wakeup_bdflush () function when the memory is insufficient or the user explicitly requests the refresh operation. In particular, this function is called in the following cases:
-A User-state process sends a sync () System Call.
-The grow_buffers () function fails to allocate a new buffer page.
-Call free_more_memory () or try_to_free_pages () to retrieve algorithms from the page ()
-The menmpool_alloc () function fails to allocate a new memory pool element.

 

In addition, the pdflush kernel thread that executes the background_writeout () callback function is awakened by processes that meet the following two conditions: 1. The page content in the page cache is modified, second, the dirty page is increased to a dirty background threshold (background threshold ). The background threshold is usually set to 10% of all pages in the system. However, you can modify the value by modifying the file/proc/sys/Vm/dirty_background_ratio.

 

The background_writeout () function depends on the writeback_control structure of the two-way communication device. On the one hand, it tells the auxiliary function writeback_modes. The following are the most important fields of this structure:

 

Sync_mode: Synchronous mode: wb_sync_all indicates that if a locked index node is encountered, it must be waited but cannot be skipped; wb_sync_hold indicates that the locked index node is placed in the linked list involved later; wb_sync_none indicates that the locked index node is skipped.

 

BDI: if it is not null, it points to the backing_dev_info structure. At this time, only the dirty pages of the Basic Block device will be refreshed.

 

Older_than_this: if it is not null, it indicates that the new index node should be omitted from the specified value.

 

Nr_to_write: Number of dirty pages to be written in the current execution stream.

Nonblocking: If this flag is set, the process cannot be blocked.

 

The background_writeout () function only applies to one parameter nr_pages, indicating the minimum number of pages to be refreshed to the disk. It essentially performs the following steps:

 

1. Read the number of pages and dirty pages in the cache from the current page per CPU variable page_state. If the proportion of dirty pages is lower than the given threshold and at least some nr_pages pages have been refreshed to the disk, this function is terminated. This threshold is usually about 40% of the total number of pages in the system. You can adjust this value by writing the file/proc/sys/Vm/dirty_ratio.

 

2. Call writeback_inodes () to write 1024 dirty pages (see below ).

 

3. Check the number of valid pages and reduce the number of pages to be written.

 

4. if less than 1024 pages have been written or some pages have been skipped, the request queue of the block device may be congested: At this time, background_writeout () the function enables the current process to sleep for 100 ms on a specific waiting queue, or keeps the current process sleep to the queue without being congested.

 

5. Return to step 1.

 

The writeback_inodes () function only acts on one parameter, that is, the pointer to the leukocyte, which points to the writeback_control descriptor. The nr_to_write field of this descriptor contains the number of pages to be refreshed to the disk. When the function returns, this field contains the remaining number of pages to be refreshed to the disk. If all goes well, the value of this field is assigned 0.

 

Let's assume that the writeback_inodes () function is called with the following conditions: pointer white blood cell-> BDI and white blood cell-> older_than_this is set to null, the wb_sync_none synchronous mode and the wb_> nonblocking flag are set by the background_writeout () function ). The writeback_inodes () function scans the super block linked list created in the super_blocks variable. When the number of pages retrieved from a complete linked list reaches the expected number, scanning is stopped. Perform the following steps for each super block SB:

 

1. check whether the Sb-> s_dirty OR Sb-> s_io linked list is empty: The first linked list contains the dirty index node of the super block, the second linked list sets the index nodes waiting to be transmitted to the disk (see below ). If both linked lists are empty, the index node of the corresponding file system has no dirty pages. Therefore, the function processes the next superblock in the linked list.

 

2. At this time, the super block has dirty index nodes. Call sync_sb_inodes () for the super block sb. This function performs the following operations:

A) insert all index nodes of Sb-> s_dirty into the linked list directed by Sb-> s_io, and clear the linked list of dirty index nodes.

B) obtain the pointer of the next index node from Sb-> s_io. If the linked list is empty, return.

C) if the index node changes to a dirty node after the sync_sb_inodes () function is executed, the dirty page of the index node is skipped and the result is returned. Note: Some dirty index nodes may be left in the Sb-> s_io linked list.

D) if the current process is a pdflush kernel thread, sync_sb_inodes () checks whether the pdflush kernel thread running on another CPU has tried to refresh the dirty page of the block device file. This is done through an atomic test and setting the bdi_pdflush flag of the index node's backing_dev_info. Essentially, it makes no sense to have multiple pdflush kernel threads on the same request queue.

E) Add 1 to the Reference Counter of the index node.

F) Call _ writeback_single_inode () to write back the dirty buffer related to the selected index node:

I. If the index node is locked, move it to the linked list of the dirty index node (inode-> I _sb-> s_dirty) and return 0. (Because we assume that the WBC-> sync_mode field is not equal to wb_sync_all, the function will not be blocked by waiting for the index node to be unlocked .)

Ii. Use the writepages method of the index node address space, or use the mpage_writepages () function without this method to write the dirty pages of the WBC-> nr_to_write. This function calls the find_get_pages_tag () function to quickly obtain all the dirty pages of the index node address space (see the section "base tree tag" in front of this chapter). The details will be described in the next chapter.

Iii. If the index node is dirty, use the write_inode method of the super block to write the index node to the disk. The function that implements this method usually relies on submit_bh () to transmit a data block.

Iv. Check the status of the index node. If the index node has dirty pages, move the index node back to the Sb-> s_dirty linked list. If the index node reference counter is 0, move the index node to the mode_unused linked list; otherwise, the index node will be moved to the inode_in_use linked list.

V. Return the error code of the function called in step 2f (II.

G) return to the sync_sb_modes () function. If the current process is a pdflush kernel thread, clear the bdi_pdflush flag set in step 2D to 0.

H) if you skip some pages of the newly processed index node, the index node includes the locked buffer zone: move all the remaining index nodes in the Sb-> s_io linked list back to the Sb-> s_dirty linked list and reprocess them later.

I) subtract 1 from the reference counter of the index node.

J) if the value of the left-side navigation pane is greater than zero, search for other dirty index nodes of the same super block in step 2B. Otherwise, the sync_sb_inodes () function is terminated.

 

3. Return to the writeback_inodes () function. If the value of the command-> nr_to_write is greater than 0, the system jumps to Step 1 and continues to process the next superblock in the global linked list. Otherwise, return.

 

3. Write back old dirty pages

 

As mentioned above, the kernel tries to avoid hunger when some pages are not refreshed for a long time. Therefore, after the dirty pages are retained for a certain period of time, the kernel will start to transmit 1/o data explicitly and write the contents of the dirty pages to the disk.

 

The work of writing back old dirty pages is delegated to the pdflush kernel thread that is regularly awakened. During kernel initialization, The page_writeback_init () function establishes a wb_timer dynamic timer, so that the timer expiration time occurs after 1% seconds specified in the dirty_writeback_centisecs file (usually one second from 500, however, you can modify the value by modifying the/proc/sys/Vm/dirty_writeback_centisecs file ). The timer function wb_timer_fn () essentially calls the pdflush_operation () function, and the parameter passed to it is the address of the callback function wb_kupdate.

 

The wb_kupdate () function traverses pages to search for obsolete dirty index nodes in the cache. It performs the following steps:

 

1. Call the sync_supers () function to write dirty Super blocks to the disk (see the next section ). Although this is not closely related to page refresh in the page cache, the call to sync_supers () ensures that the dirty time of any super block is generally less than 5 s.

 

2. Store the pointer of the value corresponding to the current time minus 30 s (expressed in jiffies) in the older_than_this field of the writeback_control descriptor. The maximum time allowed for a page to stay dirty is 30 s.

 

3. Determine the approximate number of dirty pages in the current page cache based on the page_state variable per CPU.

 

4. Call writeback_inodes () repeatedly until the number of pages written to the disk is equal to the value specified in the previous step, or until all pages that have been kept dirty for more than 30 s are written to the disk. If some request queues become congested during the loop process, the function may sleep.

 

5. use mod_timer () to restart the wb_timer dynamic Timer: Once the function has been called for a period of 1% seconds specified by the dirty_writeback_centisecs file, the timer expires (or if the execution time is too long, it will expire after 1 s from now on ).

 

4. Sync (), fsync (), and fdatasync () system call

 

Next, we will briefly introduce the three system calls used by your application to refresh the dirty buffer to the disk:

 

Sync ()

 

Allow the process to refresh all dirty buffers to the disk.

 

Fsync ()

 

Allows a process to refresh all blocks of a specific open file to the disk.

 

Fdatasync ()

 

Similar to fsync (), but does not refresh the index node block of the file.

 

4.1 sync () system call

 

The service routine sys_sync () called by the sync () system calls a series of auxiliary functions:
Wakeup_bdflush (0 );
Sync_inodes (0 );
Sync_supers ();
Sync_filesystems (0 );
Sync_filesystems (1 );
Sync_inodes (1 );

As described in the previous section, wakeup_bdflush () starts the pdflush kernel thread and refreshes all dirty pages in the page cache to the disk.

 

The sync_inodes () function scans the linked list of the super block to search for the dirty index node to be refreshed. It acts on the parameter wait. This parameter indicates whether the function must wait before the refresh is completed. The function scans the Super blocks of all currently installed file systems. For each super block containing a dirty index node, sync_inodes () first calls sync_sb_inodes () to refresh the corresponding dirty page, then, call sync_blockdev () to explicitly refresh the dirty buffer page of the block device where the super block is located. This step can be completed because the write_inode super block method of many disk file systems only marks the block buffer corresponding to the disk index node as "dirty"; the function sync_blockdev () ensures that sync_sb_inodes () the update is effectively written to the disk.

 

The sync_supers () function writes the dirty super block to the disk. If necessary, you can also use the appropriate write_super block operation. Finally, sync_filesystems () executes the sync_fs superblock method for all writable file systems. This method is only a "hook" provided to the file system. It is used only for log file systems like ext3 when special operations need to be performed for each synchronization.

 

Note: Both sync_inodes () and sync_filesystems () are called twice. One is when the parameter wait is equal to 0, and the other is when wait is equal to 1. First, they refresh unlocked index nodes to the disk. Second, they wait for all locked index nodes to be unlocked and then write them to the disk one by one.

 

4.2 fsync () and fdatasync () system calls

 

The system calls fsync () to force the kernel to write all the dirty buffers of the file specified by the file descriptor parameter FD to the disk (if necessary, it also includes the buffer with the index node ). The corresponding service routine obtains the address of the file object and then calls the fsync method. Generally, this method ends by calling the function _ writeback_single_inode (). This function writes the dirty pages and index nodes related to the selected index node back to the disk.

 

The system calls fdatasync () and fsync (), but it only writes the buffer that contains file data rather than the information of the index node to the disk. Because Linux 2.6 does not provide a dedicated fdatasync () file method, the system calls the fsync method, so it is the same as fsync.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.