Disk caching under Linux

Last Update:2017-02-28 Source: Internet

Author: User

Tags data structures file system

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some time ago in developing a system that uses SSD for caching, a large amount of disk caching occurs when writing data at high speed. Too much disk caching is very dangerous if there is a problem with the machine if it is not written to disk in a timely manner, which can result in a lot of data loss, but if the data is brushed into the disk in real time, the write efficiency is too low. In order to understand the Linux system of this disk write characteristics, the recent in-depth study. The presence of the VFS (Virtual file system) enables Linux to be compatible with different file systems, such as ext3, EXT4, XFS, NTFS, and so on, which not only has the function of implementing a common external interface for all file systems, It also has another important function related to system performance--caching. The VFS introduces a mechanism for high-speed disk caching, which is a software mechanism that allows the kernel to keep certain information on a disk that is already present in RAM so that further access to that data can be done quickly without having to access the disk itself at slow speed. The high speed disk cache can be roughly divided into the following three types:

Directory entry Cache--primarily a directory entry object that describes the file system path name

Index node Cache--The main storage is the index node object that describes the disk index node

Page caching--The main storage is the complete data page object, each page contains data that must belong to a file, and all file reads and writes depend on the page cache. It is the primary disk cache used by the Linux kernel. Because of the introduction of caching, the VFS file system uses the technique of file data delay writing, so if the system interface is written to write data without using synchronous write mode, most of the data will be saved in the cache until some conditions are met before the data is brushed into the disk.

How does the kernel brush data into disk? After two points, you can get the answer.

1. Write dirty pages to disk as we know, the kernel continues to populate the page cache with pages containing block device data. As soon as the process modifies the data, the corresponding page is marked as a dirty page, that is, its Pg_dirty flag position. The UNIX system allows the operation of the dirty buffer to be written to a block device to delay execution, because this strategy can significantly improve the performance of the system. Several writes to a page in the cache may only be satisfied by a slow physical update of the corresponding disk block. In addition, write operations are not as urgent as read operations, because processes are usually not suspended because of delayed writes, and most of the time hangs because of deferred reading. It is because of deferred writing that any physical block device will serve more than write requests on average for read requests. A dirty page may stay in main memory until the last minute (that is, until the system shuts down). However, from the limitations of deferred write strategy, it has two main disadvantages: first, if there is a hardware error or power off, then the content of RAM can not be obtained, so many changes to the file since the system startup has been lost. Second, the size of the page cache (the amount of RAM required to store it) can be large-at least to the size of the block device being accessed. Therefore, flush (write) dirty pages to disk under the following conditions:

The page cache has become too full, but more pages are needed, or too many dirty pages.

It has been too long since the page became a dirty page.

A process request refreshes any pending changes to a block device or to a particular file. Implemented by calling sync (), Fsync (), or fdatasync () system calls. The introduction of buffer pages is a more complex issue. The buffer header associated with each buffer page enables the kernel to understand the state of each isolated block buffer. If at least one of the pg_dirty flags for the buffer header is set, the PG_DIRTY flag for the corresponding buffer page should be placed. When the kernel selects a buffer to be refreshed, it scans the corresponding buffer header and writes only the contents of the dirty block to the disk effectively. Once the kernel flushes all dirty pages of the buffer to disk, the PG_DIRTY flag of the page is cleared 0.

2. Earlier versions of Pdflush kernel thread Linux used Bdfllush kernel threads to systematically scan the page cache to search for dirty pages to be refreshed, and to use another kernel thread kupdate to ensure that all pages are not "dirty" for too long. Linux 2.6 Replaces the above two threads with a set of universal kernel thread pdflush. These kernel threads have a flexible structure that acts on two parameters: a pointer to a function to be executed by the thread and a parameter to use for a function. The number of Pdflush kernel threads in the system is dynamically adjusted: Pdflush threads are created when they are too young to kill. Because the functions that these kernel threads perform can be blocked, creating multiple, rather than one, pdflush kernel threads can improve system performance. Control the generation and extinction of pdflush threads according to the following principles:

Must have at least two, up to eight Pdflush kernel threads

If there is no idle pdflush to the most recent 1s period, you should create a new Pdflush thread

If the most recent Pdflush becomes idle longer than 1s, you should delete a Pdflush thread all Pdflush kernel threads have pdflush_work descriptors, and their data structures are as follows:

Type field Description

struct task_structwho pointer to kernel thread descriptor

callback function executed by the unsigned long FN kernel thread

Unsigned longarg0 the arguments to the callback function

Links to struct list headlistpdflush_list linked list

Unsigned longwhen_i_went_to_sleep when kernel threads are available (represented by jiffies)

When the system does not have a dirty page to refresh, the Pdflush thread is automatically asleep, and finally awakened by the pdflush_operation () function. So what do the Pdflush kernel threads do in this process? Some of the work is related to the refresh of dirty data. In particular, Pdflush usually performs one of the following callback functions: 1. Background_writeout (): Systematically scans the page cache to search for dirty pages to be refreshed.

In order to get a dirty page that needs to be refreshed, it is necessary to thoroughly search for all Address_space objects (a search tree) corresponding to the index nodes that have an image on the disk. Because the page cache may have a large number of pages, if you scan the entire cache with a separate execution stream, which can cause CPU and disk to be busy for a long time, Linux uses a complex mechanism to divide the scanning of the page cache into several execution streams. The Wakeup_bdflush () function is executed when there is not enough memory or when the user explicitly invokes the request refresh operation (the user state process issues a sync () system call, etc.). The Wakeup_bdflush () function invokes the pdflush_operation () wakeup Pdflush kernel thread and delegates it to execute the callback function Background_writeout (). The Background_writeout () function effectively obtains a specified number of dirty pages from the page cache and writes it back to the disk. In addition, the Pdflush kernel thread that executes the background_writeout () function can be awakened only if the following two conditions are met: first, the page content in the page cache is modified, and the second is to cause the dirty page to increase to over a dirty background threshold. Background thresholds are typically set to 10% of all pages in the system, although you can adjust the value by modifying the file/proc/sys/vm/dirty_background_ratio.

2. Wb_kupdate (): Check the page cache for a "dirty" page for a long time, to avoid the risk of starvation when some pages are not refreshed for long periods.

The kernel creates a Wb_timer dynamic timer during initialization, with a timed spacing of 1% seconds (usually 500 per second) specified in the Dirty_writeback_centisecs file, but can be modified/proc/sys/vm/dirty_ Writeback_centisecs file to adjust this value). The timer function calls the Pdflush_operation () function, and then the address of the wb_kupdate () function is passed in. The Wb_kupdate () function traverses the page cache to search for stale dirty index nodes, writes pages that have been kept dirty for more than 30 seconds to disk, and then resets the timer.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More