Deep analysis of disk caching mechanism under Linux and the problem of SSD's writing amplification

Source: Internet
Author: User
Tags garbage collection trim

Some time ago in developing a system that uses SSD for caching, a large amount of disk caching occurs when writing data at high speed. Too much disk caching is very dangerous if there is a problem with the machine if it is not written to disk in a timely manner, which can result in a lot of data loss, but if the data is brushed into the disk in real time, the write efficiency is too low. In order to understand the Linux system of this disk write characteristics, the recent in-depth study.
The presence of the VFS (Virtual file system) enables Linux to be compatible with different file systems, such as ext3, EXT4, XFS, NTFS, and so on, which not only has the function of implementing a common external interface for all file systems, It also has another important function related to system performance--caching. The VFS introduces a mechanism for high-speed disk caching, which is a software mechanism that allows the kernel to keep certain information on a disk that is already present in RAM so that further access to that data can be done quickly without having to access the disk itself at slow speed. The high speed disk cache can be roughly divided into the following three types:
Directory entry Cache--primarily a directory entry object that describes the file system path name
Index node Cache--The main storage is the index node object that describes the disk index node
Page caching--The main storage is the complete data page object, each page contains data that must belong to a file, and all file reads and writes depend on the page cache. It is the primary disk cache used by the Linux kernel.
Because of the introduction of caching, the VFS file system uses the technique of file data delay writing, so if the system interface is written to write data without using synchronous write mode, most of the data will be saved in the cache until some conditions are met before the data is brushed into the disk.

How does the kernel brush data into disk? After reading the following two points can get the answer.

1. Write dirty pages to disk

As we know, the kernel keeps populating the page cache with pages containing block device data. As soon as the process modifies the data, the corresponding page is marked as a dirty page, that is, its Pg_dirty flag position. The
Unix system allows the operation of a dirty buffer to be written to a block device to delay execution, because this strategy can significantly improve the performance of the system. Several writes to a page in the cache may only be satisfied by a slow physical update of the corresponding disk block. In addition, write operations are not as urgent as read operations, because processes are usually not suspended because of delayed writes, and most of the time hangs because of deferred reading. It is because of deferred writing that any physical block device will serve more than write requests on average for read requests.
A dirty page may stay in main memory until the last minute (that is, until the system shuts down). However, from the limitations of the deferred write strategy, it has two main drawbacks:
One, if there is a hardware error or power off, then the contents of the RAM can no longer be obtained, so many changes to the file since the system startup have been lost.
Two, the size of the page cache (the amount of RAM required to store it) can be significant-at least to the size of the block device being accessed.
Therefore, flush (write) dirty pages to disk under the following conditions:
The page cache becomes too full, but more pages are required, or there are too many dirty pages.
It has been too long since the page became dirty. The
process requests a refresh of any pending changes to a block device or to a particular file. Implemented by calling sync (), Fsync (), or fdatasync () system calls. The introduction of the
Buffer page is a more complex issue. The buffer header associated with each buffer page enables the kernel to understand the state of each isolated block buffer. If at least one of the pg_dirty flags for the buffer header is set, the PG_DIRTY flag for the corresponding buffer page should be placed. When the kernel selects a buffer to be refreshed, it scans the corresponding buffer header and writes only the contents of the dirty block to the disk effectively. Once the kernel flushes all dirty pages of the buffer to disk, the PG_DIRTY flag of the page is cleared 0.

2. Pdflush kernel thread
Earlier versions of Linux used Bdfllush kernel threads to systematically scan the page cache to search for dirty pages to be refreshed, and to use another kernel thread kupdate to ensure that all pages are not "dirty" for too long. Linux 2.6 Replaces the above two threads with a set of universal kernel thread pdflush.
These kernel threads have a flexible structure that acts on two parameters: a pointer to a function to be executed by the thread and a parameter to use for a function. The number of Pdflush kernel threads in the system is dynamically adjusted: Pdflush threads are created when they are too young to kill. Because the functions that these kernel threads perform can be blocked, creating multiple, rather than one, pdflush kernel threads can improve system performance.
Control the generation and extinction of pdflush threads according to the following principles:
Must have at least two, up to eight Pdflush kernel threads
If there is no idle pdflush to the most recent 1s period, you should create a new Pdflush thread
If the most recent Pdflush becomes idle longer than 1s, you should remove a Pdflush thread
All Pdflush kernel threads have a pdflush_work descriptor whose data structure is as follows:

When the system does not have a dirty page to refresh, the Pdflush thread is automatically asleep, and finally awakened by the pdflush_operation () function. So what do the Pdflush kernel threads do in this process? Some of these work is related to the refresh of dirty data. In particular, Pdflush usually performs one of the following callback functions:
1. Background_writeout (): Systematically scans the page cache to search for dirty pages to be refreshed.

In order to get a dirty page that needs to be refreshed, it is necessary to thoroughly search for all Address_space objects (a search tree) corresponding to the index nodes that have an image on the disk. Because the page cache may have a large number of pages, if you scan the entire cache with a separate execution stream, which can cause CPU and disk to be busy for a long time, Linux uses a complex mechanism to divide the scanning of the page cache into several execution streams. The Wakeup_bdflush () function is executed when there is not enough memory or when the user explicitly invokes the request refresh operation (the user state process issues a sync () system call, etc.). The Wakeup_bdflush () function invokes the pdflush_operation () wakeup Pdflush kernel thread and delegates it to execute the callback function Background_writeout (). The Background_writeout () function effectively obtains a specified number of dirty pages from the page cache and writes it back to the disk. In addition, the Pdflush kernel thread that executes the background_writeout () function can be awakened only if the following two conditions are met: first, the page content in the page cache is modified, and the second is to cause the dirty page to increase to over a dirty background threshold. Background thresholds are typically set to 10% of all pages in the system, although you can adjust the value by modifying the file/proc/sys/vm/dirty_background_ratio.
2. Wb_kupdate (): Check the page cache for a "dirty" page for a long time, to avoid the risk of starvation when some pages are not refreshed for long periods.

The kernel creates a Wb_timer dynamic timer during initialization, with a timed spacing of 1% seconds (usually 500 per second) specified in the Dirty_writeback_centisecs file, but can be modified/proc/sys/vm/dirty_ Writeback_centisecs file to adjust this value). The timer function calls the Pdflush_operation () function, and then the address of the wb_kupdate () function is passed in. The Wb_kupdate () function traverses the page cache to search for stale dirty index nodes, writes pages that have been kept dirty for more than 30 seconds to disk, and then resets the timer.

PS: The problem of writing amplification for SSD

Now solid-state drives are increasingly being used as server disks. Prior to designing and implementing a caching system for storing data blocks on SSD (Solid state Drive), a number of problems were encountered. For example, after the disk is full, if the aging off some of the longest unused blocks, continue to write a large number of new data, over time, the write speed is much slower than the beginning. In order to figure out why this happened, and then search the Internet for some information on SSD, the original situation is due to the SSD hardware design itself, the final mapping to the application of the phenomenon known as the write amplification (wa:write amplification), WA is an extremely important attribute of Flash and SSD related, which was first proposed and used in public submissions by Intel Corporation and Siliconsystems Corporation (acquired by Western data in 2009) in 2008. Here's a brief explanation of why this is happening and what a process is.
SSD design is completely different from the traditional mechanical disk, it is a complete electronic equipment, no traditional mechanical disk reading and writing head. As a result, SSD provides high IOPS performance when reading and writing data due to the lack of a search process between the magnetic head and the track. Also because it is less than the head of the scheduling, so SSD can reduce the use of electricity, in the data center to use in the enterprise is very useful.
SSD has a great performance advantage over traditional disks, and a lot of advantages, but things always have two sides, it also has some problems, written in the SSD data can not be directly updated, can only be overridden by sector overrides, before overwriting overrides need to erase first, And the erasure operation is not done on the sector, can only be done on the disk block, erase the block before you need to have the original valid data first read out, and then write together with the new data, these repeated operations will not only increase the amount of data written, but also reduce the life of the Flash, Eat more of the available bandwidth of the flash and indirectly affect random write performance.

Write Amplification Solution
In the actual operation, we can not completely solve the problem of SSD write amplification, only through some methods to more effectively reduce the magnification. An easy way to do this is to use a large SSD drive with only a fraction of its capacity, such as 128GB you only use 64GB, and the worst case scenario is a 3 times-fold reduction in write amplification. Of course this method is a bit too wasteful of resources. In addition, you can write the data in sequential write, when the SSD is sequentially written, its write magnification is generally 1, but some factors will affect the value.
In addition to the above methods, at this stage is generally accepted the better method is trim. Trim is located on the operating system layer. The operating system uses the trim command to notify the SSD that the data for a page is not needed and can be recycled. The main difference between the operating system that supports trim and the previous one is that deleting a page is different. During disk time, after deleting a page, the page's flag bit is set to be available in the file system's record information, but the data is not deleted. Use SSD and support trim operating system, when deleting a page, will also notify SSD this page data does not need, SSD internal has a free time garbage collection process, in idle time SSD will some idle data together, then together erase. So every time you write, you write new data on a page that's already erase.

Although it has a problem with write amplification, this does not allow us to refuse to use it. Caching acceleration has been used in a number of projects, especially in database caching projects, where SSD efficient read performance is fully exploited. With the release of Facebook's Open-source flash cache and the extensive use within Facebook, Flash cache has become a more mature technology, allowing more companies to opt for SSD storage or caching.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.