The Linux kernel's file pre-read detailed

Source: Internet
Author: User
Tags andrew morton file copy prefetch

Linux file prefetching algorithm the development of disk I/O performance lags far behind CPU and memory, thus becoming a main bottleneck of modern computer system. Prefetching can effectively reduce the number of disk seek and the application of I/O wait time, is to improve disk read I/O performance is one of the important optimization methods. The writer, a PhD student at the University of Science and Technology in China, began his study of Linux in 1998, and in order to optimize the server's performance, he began to try to improve the Linux kernel and eventually rewrite the kernel's file-reading section, which is included in the Linux kernel 2.6.23 and subsequent versions.

From registers, L1/L2 cache, memory, flash, to disk/CD/tape/storage networks, the computer's various levels of memory hardware constitute a pyramid structure. The lower the underlying storage capacity. The slower the speed of access, however, is shown by smaller bandwidth and greater latency. So naturally it becomes a pyramid-by-layer caching structure. This results in three basic cache management and optimization issues:

Prefetch (prefetching) algorithm to load data from slow storage to cache;

Substitution (replacement) algorithm to discard unwanted data from the cache;

Write-back (writeback) algorithm to save dirty data from cache to slow storage.

The prefetching algorithm is especially important at the disk level. Disk manipulator + Rotary disc data positioning and reading mode, determines its most outstanding performance characteristics: good at sequential reading and writing, not good at random i/o,i/o delay is very large. This has resulted in two aspects of the pre-read requirements.

Requirements from disk

Simply put, a typical I/O operation of a disk consists of two phases:

1. Data positioning

The average location time is mainly composed of two parts: mean seek time and average rotational delay. The typical value of seek time is 4.6ms. The rotation delay depends on the speed of the disk: the normal 7200RPM desktop hard disk rotation delay is 4.2ms, and the high-end 10000RPM is 3ms. These figures have been hovering for years and probably will not be able to improve much in the future. In the following, we might as well use 8ms as a typical positioning time.

2. Data transmission

The continuous transmission rate depends mainly on the disk speed (line speed) and storage density, the most recent typical value is 80mb/s. Although the speed of disk is difficult to improve, but the storage density is improving every year. The adoption of a series of new technologies, such as giant reluctance and vertical magnetic recording, not only greatly increases the disk capacity, but also brings about a higher continuous transmission rate.

Obviously, the greater the granularity of I/O, the greater the proportion of transmission time in total time, and thus the greater the disk utilization and throughput. The simple estimate results are shown in table 1. If you do a lot of 4KB random I/O, the disk is busy locating over 99% of the time, and the throughput of a single disk is less than 500kb/s. But when the I/O size reaches 1MB, the throughput is close to 50mb/s. This shows that the use of greater I/O granularity, you can improve the efficiency and throughput of the disk 100 times times full. Therefore, it is necessary to do everything possible to avoid small size I/O, which is what the prefetching algorithm does.

Table 1 Relationship between random read size and disk performance

Requirements from the program

A typical process for an application to process data is this: while (!done) {read (); compute (); Assuming that the loop repeats 5 times and handles a total of 5 batches of data, the sequence diagram that the program runs may look like Figure 1.

Figure 1 A typical I/O sequence diagram

It is easy to see that the disk and CPU are alternately busy: The CPU is waiting when disk I/O is being made, and the disk is idle when the CPU is calculating and processing data. So is it possible to have both pipelining to speed up the execution of the program? Pre-reading can help achieve this goal. The basic approach is to preload the next batch of data by the kernel's pre-read mechanism when the CPU starts processing the 1th batch of data. The prefetching is done asynchronously in the background, as shown in Figure 2.

Figure 2 Pre-read pipelined operations

Note that here we have not changed the behavior of the application: the next read request for the program is still sent after the current data has been processed. Only this time the requested data may have been in the kernel cache, no need to wait, directly can be copied over. Here, the function of asynchronous prefetching is a large delay in "hiding" disk I/O to the upper application. Although the delay actually persists, the application does not see it, and thus runs more smoothly.

The concept of pre-read

The concept and application of prefetching algorithm is very extensive. It exists at all levels of the CPU, the hard disk, the kernel, the application, and the network. There are two options for prefetching: heuristic (heuristic prefetching) and informed (informed Prefetching). The former automatically makes the pre read decision, it is transparent to the upper layer, but the demand for the algorithm is high, the latter is simple to provide API interface, and the upper program gives explicit read-only instructions. At the disk level, Linux provides us with three API interfaces: Posix_fadvise (2), ReadAhead (2), Madvise (2).

However, there are few applications that actually use the pre-read API: Because in general, heuristic algorithms in the kernel work well. The pre-read (ReadAhead) algorithm predicts the pages to be visited and reads them into the cache in batches in advance.

Its main functions and tasks can be summed up in three key words:

Bulk, that is, to aggregate small I/O into large I/O to improve the utilization of the disk and improve the throughput of the system.

In advance, that is, I/O latency on the application's hidden disk to speed up program operation.

Prediction, which is the core task of the prefetching algorithm. The first two functions depend on accurate predictive power. Current mainstream operating systems, including Linux, FreeBSD, and Solaris, follow a simple and effective principle: to divide the reading pattern into random and sequential readings, and to read only sequential reads. This principle is relatively conservative, but can guarantee a high rate of pre read, while efficiency/coverage is also good. Because sequential reading is the simplest and most common, random reading is really unpredictable in the kernel.

Pre-read architecture for Linux

A major feature of the Linux kernel is the support of the most file systems and the virtual file system (VFS) layer. As early as 2002, during the development of the 2.5 kernel, Andrew Morton introduced a basic framework for file prefetching in the VFS layer to unify the various file systems. As shown in the picture, the Linux kernel caches its most recently accessed file page in memory for a period of time, a file cache called Pagecache. As shown in Figure 3. The general read () operation occurs between the buffer provided by the application and the Pagecache. The prefetch algorithm is responsible for populating the Pagecache. The application read cache is generally relatively small, such as the file Copy command CP Read and write granularity is 4KB; The kernel's read-ahead algorithm will read I/O to the size it thinks is more appropriate than for example 16-128kb.

Figure 3 Pagecache-centric Read and read

About a year later, Linus Torvalds the prefetch algorithm for mmap page I/O, thus creating a read-around/read-ahead two independent algorithms (Figure 4). The read-around algorithm is suitable for program code and data that are accessed in mmap mode, and they have strong local (locality of reference) characteristics. When there is a page fault occurred, it is centered on the current pages, ahead of the total 128KB page. And the ReadAhead algorithm is mainly for the read () system calls, they generally have a good order characteristics. However, the random and atypical reading patterns are also very large, so the readahead algorithm must have good intelligence and adaptability.

Figure 4 Read-around, read-ahead and direct read in Linux

Another year later, through a lot of work by Steven Pratt, Ram Pai and others, the ReadAhead algorithm was further refined. One of the most important is to achieve a good support for random reading. Random reading is a very prominent position in the application of database. Prior to this, the prefetch algorithm takes a discrete read-page position as input, and a multiple-page random read triggers "sequential prefetching". This results in an increase in the number of read I/O and a drop in hit ratio. The improved algorithm can better distinguish between sequential and random reads by monitoring all the full read () calls, while getting the page offset and number of read requests.

Overview of pre-read algorithms

This section takes Linux 2.6.22 as an example to analyze several key points of the prefetching algorithm.

1. Sequential detection

To ensure a prefetch hit rate, Linux reads only sequential reads (sequential read). The kernel determines whether a read () reads sequentially by verifying the following two conditions:

This is the first read after the file is opened, and the file header is read;

The current read request and the previous (logged) Read request are in a contiguous position within the file.

If the above order condition is not satisfied, it is judged as random reading. Any random read terminates the current sequence of sequences, thus terminating the read-only behavior (rather than reducing the prefetch size). Notice that the spatial order here is the offset within the file, not the continuity of the physical disk sector. Linux is simplified here, and the basic premise of its effectiveness is that files are essentially contiguous stored on disk without serious fragmentation.

2. Pipeline Pre-Reading

When the program processing a batch of data, we hope that the kernel can be in the background to the next batch of data prepared in advance, so that the CPU and hard drive can work in line. Linux uses two read-only Windows to track the prefetch status of the current sequential stream: the existing window and the Ahead window. The Ahead window is prepared for pipelining: When the application is working in the current window, the kernel may be doing asynchronous prefetching in the Ahead window, and once the program enters the Ahead window, the kernel immediately pushes forward two windows and initiates read I/O in the new ahead window.

3. Pre-read size

When a sequential prefetch (sequential readahead) is determined, the appropriate prefetch size needs to be determined. Pre-read granularity is too small, can not achieve the desired performance improvement effect, too much prefetching, and it is possible to load too many programs do not need pages, resulting in waste of resources. To do this, Linux uses a fast window expansion process:

First pre-read: Readahead_size = read_size * 2; or *4

The initial value of the read-only window is two to four times times the size of the read. This means that using a large read granularity (such as 32KB) in your program can slightly increase I/O efficiency.

Follow-up pre-reading: Readahead_size *= 2;

Subsequent read-through Windows will multiply until the system's maximum prefetch size is reached, with a default of 128KB. This default has been in use for at least five years and is far too conservative in the face of today's faster hard drives and high-capacity memory. For example, the Western Data company in recent years the WD Raptor Raptor 10000RPM SATA hard drive, in the 128KB random read, can only achieve 16% of the disk utilization (Figure 5). So if you're running a Linux server or desktop system, try using the following command to elevate the maximum read value to 1MB to see, there may be surprises:

# Blockdev–setra 2048/DEV/SDA

Of course, the prefetch size is not the bigger the better, in many cases, I also need to consider the I/O latency problem.

Figure 5 128KB I/O data location time and transmission time proportion

Re-discovery Sequential read

In the previous section we solved the basic question of whether or when to read the pre-reading and how much to read. Because of the complexity of the reality, these algorithms do not always work, even for sequential readings. For example, the recent discovery of the heavy probation (retried read) problem.

Heavy probation is more common in asynchronous I/O and non-blocking I/O. They allow the kernel to interrupt a read request. As a result, subsequent read requests submitted by the program appear to overlap the previously interrupted read requests. As shown in Figure 6.

Fig. 6 Heavy probation (retried reads)

Linux 2.6.22 not understand this, and then misjudged it as random reading. The problem here is that the read request does not mean that the read operation actually happens. The decision basis for prefetching should be the latter rather than the former. This has been improved by the latest release of 2.6.23. The new algorithm is based on the current read page state, and adds a bit of page flag: Pg_readahead, which is a hint of "please read asynchronously". Each time a new read is made, the algorithm selects one of the new pages and marks them. The read-only rule changes accordingly:

Read the missing page (missing page) for synchronous prefetching;

Read the Pre-read page (Pg_readahead page) for asynchronous prefetching.

As a result, the ahead read window does not need to: it actually binds the prefetch size and the advance amount unnecessarily. The new tagging mechanism allows us to flexibly and accurately control the advance of prefetching, which will help to introduce support for notebook power-saving patterns in the future.

Figure 7 Dynamic of the Linux 2.6.23 pre-read algorithm

Another growing problem comes from interlaced reading (interleaved read). This read mode is common in multimedia/multithreading applications. When multiple streams are read simultaneously in an open file, their read requests are intertwined and appear to be a lot of random reads in the kernel. More seriously, the current kernel can only track the read state of a stream in an open file descriptor. Thus, even if the kernel reads two streams, they overwrite and destroy each other's read state information. In this respect, we will make some improvements in the upcoming 2.6.24, using the status information provided by the page and Pagecache to support the interleaved reading of multiple streams.

Pre-read recommendations

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.