LINUX Kernel Note series: IO block parameter diagram

Source: Internet
Author: User

Under Linux, the level of I/O processing can be divided into 4 tiers:

    1. System call layer, the application uses system calls to specify which file to read and write, and what the file offset is
    2. The file system layer, which copies buffer from the user state to the kernel state when writing the file, and caches that part of the data by the cache
    3. Block layer, manage block device I/O queues, Merge and sort I/O requests
    4. Device layer, which directly interacts with memory via DMA to write data to disk

Clearly illustrates the Linux I/O hierarchy:

Writing a file process

The process of writing a file contains the process of reading, the file is loaded into memory from disk, stored in the cache, the disk content and physical memory page to establish a mapping relationship. The write function for writing the file is declared as follows:

    1. ssize_t Write (int fd, const void *buf, size_t count);

Where FD corresponds to the file structure of the process, buf points to the data being written. The kernel finds the physical page in the cache that corresponds to the file being written, and write determines the number of pages that write memory, such as "Echo 1 > a.out" (the underlying call to write) writes to the No. 0 position of the a.out file, and write writes the first page of the corresponding memory.

When the Write function modifies the memory contents, the corresponding memory page, Inode is marked as dirty, and the Write function returns. Note that the data has not been written to disk, but the contents of the cache have been modified.

When will the contents of the memory be brushed to disk?

The work of brushing dirty data to disk is done by kernel thread flush, flush searches for dirty data in memory, writes dirty data to disk by setting, we can view and set the policy of flush brush dirty data by SYSCTL command:

    1. Linux # Sysctl-a | grep centi
    2. Vm.dirty_writeback_centisecs = 500
    3. Vm.dirty_expire_centisecs = 3000
    4. Linux # Sysctl-a | grep background_ratio
    5. Vm.dirty_background_ratio = 10

The above values are in 1/100 seconds, "Dirty_writeback_centisecs = 500" indicates that flush executes every 5 seconds, "Dirty_expire_centisecs = 3000" Indicates that dirty data residing in memory for more than 30 seconds will be written to disk by flush at the next execution, and "Dirty_background_ratio = 10" indicates that if the dirty page accounts for more than 10% of the total physical memory, then flush is triggered to write the dirty data back to the disk.

Flush identifies the dirty data that needs to be written back to the disk, and the physical page that stores the dirty data corresponds to which sectors of the disk?

The correspondence between the physical page and the sector is defined by the file system, which defines a memory page (4KB) corresponding to the number of blocks, the corresponding relationship is set when the disk is formatted, and the corresponding relationship is saved by Buffer_head at runtime:

    1. Linux # Cat/proc/slabinfo | grep buffer_head
    2. Buffer_head 12253 12284 104 Notoginseng 1:tunables 332 332 0

The file system layer tells the block I/O layer which device to write, which block, after executing the following command, we can see the file system layer issued to the block layer of read and write requests in/var/log/messages:

    1. Linux # echo 1 >/proc/sys/vm/block_dump
    2. Linux # Tail-n 3/var/log/messages
    3. 7 00:50:31 linux-q62c kernel: [7523.602144] Bash (5466): READ block 1095792 on sda1
    4. 7 00:50:31 linux-q62c kernel: [7523.622857] Bash (5466): dirtied inode 27874 (tail) on sda1
    5. 7 00:50:31 linux-q62c kernel: [7523.623213] Tail (5466): READ block 1095824 on SDA1

The block I/O layer uses a struct bio to record the I/O requests issued by the file system layer, which mainly stores the physical page information that needs to be brushed to disk, and the sector information on the corresponding disk.

The block I/O layer maintains an I/O request queue for each disk device, and the request queue is represented by struct request_queue in the kernel. Each read or write request needs to be processed by the Submit_bio function, Submit_bio the read and write requests into the corresponding I/O request queue. The most important role of this layer is to merge and sort I/O requests, which reduces the actual disk read and write times and seek time to optimize disk read and write performance.

Using crash to parse the Vmcore file, execute the "dev-d" command to see information about the block device request queue:

    1. Crash > dev-d
    2. MAJOR gendisk NAME REQUEST QUEUE total ASYNC SYNC DRV
    3. 8 0xffff880119e85800 SDA 0xffff88011a6a6948 10 0 0 10
    4. 8 0xffff880119474800 sdb 0xffff8801195632d0 0 0 0 0
The "struct Request_queue 0xffff88011a6a6948" can be executed to resolve the Request_queue request queue structure corresponding to the SDA device above.

You can view the request queue size of the SDA device by executing the following command:

    1. Linux # Cat/sys/block/sda/queue/nr_requests
    2. 128

How to merge and sort I/O requests, that is the work of the I/O scheduling algorithm, Linux supports a variety of I/O scheduling algorithms, through the following commands can be viewed:

    1. Linux # Cat/sys/block/sda/queue/scheduler
    2. NoOp anticipatory deadline [CFQ]

Another function of the block I/O layer is to count the I/O read and write situations, execute the iostat command, and see the statistics provided by that layer:

    1. Linux # iostat-x-k-d 1
    2. device:rrqm/s wrqm/s r/s w/s rkb/s wkb/s avgrq-sz avgqu-sz await SVCTM%util
    3. SDA 0.00 9915.00 1.00 90.00 4.00 34360.00 755.25 11.79 120.57 6.33 57.60

where rrqm/s, wrqm/s respectively indicates the number of write requests per second, read requests merge.

The Task_io_account_read function is used to count the amount of read requests initiated by each process, and the function obtains the exact value of the amount of read requests from the process. And for the write request, because the write call is returned after the data is written to the cache, so at the kernel level cannot be counted to a process to initiate the exact amount of write requests, read process will wait for the buff to be available, while the write is written to the cache back, read is synchronous, write is not necessarily synchronous, this is the most important difference between the

Then down is the device layer, where the device pulls the I/O request from the queue, and the SCSI SCSI_REQUEST_FN function is the task that completes the fetch request and processes it. The SCSI layer eventually translates the processing request into instructions, which is followed by a DMA (direct memory access) map, which maps part of the memory cache to the DMA so that the device bypasses the CPU and operates directly in main memory.

After the device layer finishes the memory data to disk copy, the message is escalated to a layer, and the kernel removes the dirty bit flag from the original dirty page.

The above is the approximate implementation of the write disk, for the read disk, the kernel first in the cache to find the corresponding content, if the hit will not be disk operations. If the process reads a byte of data, the kernel does not simply return a byte, which is in the page (4KB) and returns at least one page of data. In addition, the kernel reads the disk data, and executes the following command to see the maximum amount of data (in kilobytes) that can be read ahead:

    1. Linux # cat/sys/block/sda/queue/read_ahead_kb
    2. 512

Let's look at the pre-read mechanism of the kernel through a SYSTEMTAP code:

    1. Test.stp
    2. Probe kernel.function ("Submit_bio") {
    3. if (execname () = = "DD" && __bio_ino ($bio) = = 5234)
    4. {
    5. printf ("Inode%d%s on%s%d bytes start%d\n",
    6. __bio_ino ($bio),
    7. Bio_rw_str ($bio),
    8. __bio_devname ($bio),
    9. $bio->bi_size,
    10. $bio->bi_sector)
    11. }
    12. }

The above code indicates when the DD command reads and writes the inode number 5234 file, passes through the kernel function Submit_bio, outputs the inode number, the operation mode (read or write), the file's device name, the read and write size, the sector code information. Execute the following code to install the probe module:

    1. STAP TEST.STP &

We then use the DD command to read the file inode number 5234 (the file inode number can be obtained via the stat command):

    1. DD if=airport.txt of=/dev/null bs=1 count=10000000

The above command intentionally sets the BS to 1, which reads one byte at a time to observe the kernel read-ahead mechanism. In the process of executing this command, we can see the following output in the terminal:

    1. Inode 5234 R on sda2 16384 bytes start 70474248
    2. Inode 5234 R on sda2 32768 bytes start 70474280
    3. Inode 5234 R on sda2 32768 bytes start 70474352
    4. Inode 5234 R on sda2 131072 bytes start 70474416
    5. Inode 5234 R on sda2 262144 bytes start 70474672
    6. Inode 5234 R on sda2 524288 bytes start 70475184

From the above output, the pre-read from 16384 bytes (16KB) gradually increased, and finally changed to 524288 bytes (512KB), the visible kernel will dynamically adjust the amount of pre-read data according to the reading situation.

Since both the read and write disks are processed by the Submit_bio function, the underlying implementations of read and write are roughly the same after Submit_bio.

Direct I/O

When we open a file by invoking the Open function with the O_DIRECT flag, subsequent read and write operations on the file will be done as direct I/O (directly I/O), and for bare devices, I/O is also direct I/O.

Direct I/O skips the level of the filesystem, but the block layer still works, which corresponds to the memory page to the disk sector, which is no longer a cache-to-DMA mapping, but a buffer of the process mapped to the DMA. Direct I/O is required to read and write an integer multiple of a sector (512bytes), otherwise the portion of a non-integer fold will be read and written with the cache.

With direct I/O, the write disk has less user-state-to-kernel copy, which improves the efficiency of the write disk and also the direct I/O function. For a read operation, the first direct I/O will be faster than with the cache, but will be read from the cache after subsequent re-reading with the cache, so subsequent reads will be faster than the direct I/O. Some databases use direct I/O and implement their own cache mode.

asynchronous I/O

Linux has two types of asynchronous I/O (asynchronous I/O), one is the Aio_read/aio_write library function call, which is implemented in a pure user-state implementation, relying on multi-threading, the main thread to send I/O to a dedicated processing I/O thread, This achieves the purpose of the main thread async.

The other is Io_submit, which is a system call provided by the kernel, and the use of Io_submit also requires that the file be opened as O_direct, and that read and write be aligned by sector.

Reference:chapter 14-the Block I/o Layer, Linux kernel development.3rd.edition

LINUX Kernel Note series: IO block parameter diagram

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.