IO scheduling algorithm and writeback mechanism for Linux block devices

Source: Internet
Author: User

**************************************************************************************

Reference:

"Linux kernel design and implementation"

http://laokaddk.blog.51cto.com/368606/699028/

Http://www.cnblogs.com/zhenjing/archive/2012/06/20/linux_writeback.html

**************************************************************************************

1 Linux block IO requests

The smallest addressable unit in a block device is a sector. The size of the sector is generally 2 integer times, and the most common size is 512 bytes. The size of the sector is the physical property of the device, which is the basic unit of all the block devices that cannot be addressed and manipulated by the block device compared to its smaller units, although many block devices can transmit multiple sectors at once. From a software perspective, the smallest logical addressable unit is a block, which is an abstraction of the file system-----can only access the file system based on blocks. Although physical disk addressing is done at the sector level, all disk operations performed by the kernel are done in blocks. As mentioned earlier, the sector is the smallest addressable unit of the device, so the block cannot be smaller than the sector and can be several times more than the sector size. In addition, the kernel requires that the block size is an integer multiple of 2, = and cannot exceed the length of a page, so the final requirement for size is that it must be an integer multiple of 2 of the sector size and smaller than the page size. So the usual block size is 512 bytes, 1k or 4k.

When a block is called into memory, it is stored in a buffer, with each buffer corresponding to a block, which corresponds to the representation of the disk block in memory. Also, since the kernel requires some control information to process the data, each buffer has a descriptor called buffer_head, called the buffer header, defined in the linux/buffer_head.h.
This struct acts as a descriptor in the kernel, describing the mapping from buffer to block. Using the buffer header as an I/O operation has its drawbacks, here is not to elaborate, you understand just fine. All we need to know is that the kernel now uses a new, flexible and lightweight container---the bio structure.
The bio structure is defined in Linux/bio.h, which represents block I/O operations that are being organized in the field (active) as a fragment (segment) list. A fragment is a small contiguous memory buffer. In this case, you do not need to ensure that a single buffer must be contiguous, all through the fragment to describe the buffer, even if a buffer is scattered in multiple locations of memory, the bio structure can also guarantee the execution of I/O operations. The following is a description of the bio structure and the individual fields as follows:

struct Bio {sector_t bi_sector;          /* Associated sector on disk */struct bio *bi_next;          /* List of requests */struct block_device *bi_bdev;          /* Associated block device */unsigned long bi_flags;             /* Status and command flags */unsigned long BI_RW; /* Read or write?           */unsigned short bi_vcnt;            /* Number of bio_vecs off */unsigned short bi_idx;  /* Current index in Bi_io_vec */unsigned short bi_phys_segments;    /* Number of segments after coalescing */unsigned short bi_hw_segments;           /* Number of segments after remapping */unsigned int bi_size;  /* I/O count */unsigned int bi_hw_front_size;   /* Size of the first mergeable segment */unsigned int bi_hw_back_size; /* Size of the last mergeable segment */unsigned int Bi_max_vecs;        /* Maximum bio_vecs possible */struct BIO_VEC *bi_io_vec;        /* Bio_vec list */bio_end_io_t *bi_end_io;            /* I/O completion method */atomic_t bi_cnt;       /* Usage counter */void *bi_private;    /* Owner-private method */bio_destructor_t *bi_destructor; /* destructor method */};

The purpose of using the bio structure is primarily to represent I/O operations that are being performed on-site, and all the primary domains in that structure are used to manage the relevant information. The most important of these fields are bi_io_vecs,bi_vcnt and bi_idx. The relationship between them is as follows:


I have given the structure of the struct bio in front of me, and here is a description of struct Bio_vec:

struct Bio_vec {   struct page     *bv_page;   /* Pointer to the physical page in which this buffer resides */    unsigned int    bv_len;     /* The length in bytes of this buffer */    unsigned int    bv_offset;   /* The byte offset within the page where the buffer resides */};


To analyze the above diagram above, we say: each block I/O request is represented by a bio struct. Each request contains one or more blocks, which are stored in an array of BIO_VEC structures that describe the actual position of each fragment in the physical page, and are organized like a vector, the first fragment of the IO operation is pointed to by the B_io_vec struct, and the other fragments are then placed sequentially. A total of bi_vcnt pieces. When block IO starts to execute the request, the Bi_idx field is constantly updated to always point to the current fragment when each fragment needs to be used. The Bi_idx field points to the current Bio_vec fragment in the array, and the block IO layer tracks the completion progress of the block IO operation through it. But the more important role of the domain is to split the bio structure. The Bi_cnt field records the usage count of the bio structure and, if 0, destroys the bio struct and frees the memory it occupies. Use the following two functions to manage the usage count:

void Bio_get (struct bio *bio);
void Bio_put (struct bio *bio);
The last domain is the bi_private domain, which is a private domain owned by the owner who created the bio structure and who can read and write to the domain.
Block devices hold their pending block IO requests in the request queue, which is represented by the request_queue struct body, defined in the file Linux/blkdev.h, containing a two-way request chain list and associated control information. The request is added to the queue by a high-level code like the file system in the kernel. As long as the request queue is not empty, the block device driver that corresponds to the queue gets the request from the queue header and then feeds it to the corresponding block device. Each item in the Request queue table is a separate request, represented by a reques struct body. Requests in the queue are represented by the struct request, defined in the file linux/blkdev.h. Because a request may operate on multiple contiguous disk blocks, all of the requests can be made up of multiple bio structures, noting that although the blocks on the disk must be contiguous, the blocks in memory do not necessarily have to be contiguous----each bio structure can describe multiple fragments, Each request can also contain more than one bio structure.

Well, we understand the block IO request, and the following is the IO dispatch. Each addressing operation is to locate the disk head to a certain location on a specific block, in order to optimize the addressing operation, the kernel will not simply receive the order by request, nor immediately submit it to the disk, instead, it will be a pre-operation called Merge and sort before committing, this kind of pre-operation can greatly improve the overall performance of the system.

2 Linux kernel block device IO subsystem

Linux IO Scheduler is a block device I/O The primary component of a subsystem, which is between a common block layer and a block device driver, such as is shown. When the Linux kernel component reads and writes data, it does not execute as soon as it is requested, but instead puts the request in the request (input) queue and defers execution. Why is it so designed? The reason is that the most core block device Linux needs to deal with is disk. Disk seek time severely restricts disk performance, and if you want to improve disk IO performance, you must reduce the number of disk seek times.

Block device I/O subsystem The most core task is to improve the overall performance of the block device, for this implementation of Linux four io scheduling algorithm, the basic idea of the algorithm is to consolidate and sort the requests in the IO request queue to greatly reduce the required disk seek time, This provides overall IO performance.

2.6the kernel implements four kinds ofIOscheduling algorithms, respectively, as expected(anticipatory)algorithms, deadlines(Deadline)algorithms, perfectly fair pairs of columns(CFQ)Algorithms andNOOPalgorithm(No operation). When the kernel boots, the user can specify aI/Oscheduling algorithms, which can also be used at run timeSysfsFile System/sys/block/sda/queue/schedulerchanging the block deviceI/Oscheduling Algorithm (cat can view the current usageIOscheduling algorithm). The default IO Scheduler is the "predictive" IO scheduler.

"Noop" algorithm

The simplest I/O scheduling algorithm. The algorithm only properly merges user requests and does not order requests: New requests are usually inserted at the beginning or end of the dispatch queue, and the next request to be processed is always the first request in the queue. This algorithm is designed for block devices that do not require pathfinding, such as SSDs.

"CFQ" algorithm

The primary goal of the CFQ (fully fair queue) algorithm is to ensure a fair allocation of disk I/O bandwidth in all processes that trigger I/O requests. To achieve this, the algorithm uses a number of sort queues-the default is 64--, which store requests made by different processes. When the algorithm processes a request, the kernel invokes a hash function that sets the thread group identifier (PID) of the current process, and then the algorithm inserts a new request to the end of the queue. Therefore, requests made by the same process are usually inserted in the same queue.

The algorithm essentially uses polling to scan the I/O input queue, selects the first non-empty queue, dispatches a specific number of (fair) requests in a different queue, and then moves those requests to the end of the dispatch queue .

"Deadline" algorithm

In addition to scheduling queues, the "Deadline" algorithm uses four of queues. The two sort queues contain read requests and write requests, each of which is ordered based on the starting sector area code. The other two deadline queues contain the same read and write requests, but this is sorted according to their "deadlines". These queues are made to avoid requests for starvation, which occurs because the elevator strategy (the previous scheduling algorithm) prioritizes the last request with the last request being processed, and therefore ignores a request for a long time. The deadline for a request is essentially a timeout timer that starts when the request is passed to the elevator algorithm. By default, the time-out for a read request is 500ms, and the time-out for a write request is 5s--read requests take precedence over write requests, because read requests usually block the process that makes the request. The deadline guarantees that the scheduler takes care of the request that waits for a long time, even if it is at the end of the sort queue.


When the algorithm wants to replenish the dispatch queue, the data direction of the next request is determined first. If both the read and write two requests are scheduled, the algorithm chooses the "read" direction, unless the "write" direction has been discarded many times (in order to avoid starvation of write requests).

Next, the algorithm examines the deadline queue associated with the selected direction: If the deadline for the first request in the queue is exhausted, the algorithm moves the request to the end of the dispatch queue. At the same time, it also moves a set of requests from the same sector area code of the sorted queue that follows the expired request. If the request to be moved is physically adjacent to the disk, the queue length will be long, otherwise it will be short.

Finally, if there is no request timeout, the algorithm dispatches a request from the same set of sectors that came after the last request from the sort queue. When the pointer reaches the end of the sort queue, the search starts again ("one-direction algorithm").

"expected" algorithm

The "expected" algorithm is one of the most complex 1/o scheduling algorithms provided by Linux. Basically, it is an evolution of the "deadline" algorithm, borrowing the basic mechanism of the "deadline" algorithm: Two deadline queues and two sort queues; the I/O Scheduler interactively scans the sort queue between read and write requests, but prefers to read requests. The scan is basically contiguous unless a request times out. The default time-out for a read request is 125MS, and the default time-out for write requests is 250ms. However, the algorithm also follows some additional heuristic guidelines:

In some cases, the algorithm may select a request after the current position of the sort queue, forcing the head to search from behind. This typically occurs when the search distance after this request is less than half the search distance after the current position of the sort queue.

The algorithm counts the kinds of I/O operations that are triggered by each process in the system. When a read request from a process P has just been dispatched, the algorithm immediately checks to see if the next request in the sort queue is from the same process p. If it is, dispatch the next request immediately. Otherwise, view statistics about the process P: if it is determined that the process P may make another read request soon, it will be delayed for a short period of time (the default is approximately 7ms). Therefore, the algorithm predicts that the read request issued by the process P and the newly dispatched request may be "near-neighbor" on disk.

3 Linux write-back mechanism

Regardless of how the block device scheduling algorithm is optimized, it is not possible to solve the problem of a serious mismatch between disk IO and CPU speed, which introduces a page cache for Linux. The page cache was initially designed for memory management, and in the 2.6 kernel, various page-based data management was included in the page cache. Therefore, the IO buffers of the block devices are also part of the page cache. These are not related to the user, but are the kernel developers need to be concerned about. For developers, you need to know that all of the file's IO operations are read-write caches. For read operations, IO operations are required only if the data is not cached. For write operations, IO operations are required, but the kernel writes the data to the cache and the write system calls back immediately, and the kernel uses a specific write process to uniformly write back the dirty cache page. That is, the kernel to read and write is treated separately: "Synchronous read, asynchronous write"!

the writeback of Linux is written back by a particular process followed by a specific algorithm. In the kernel before 2.6.32, thepdflush background writeback routine is responsible for completing this work. When do you write it back? in the following two cases, the dirty pages will be written to disk:

1. When free memory falls below a specific threshold, the kernel must write the dirty page back to disk in order to free up memory.
2. When dirty pages reside in memory more than a certain threshold, the kernel must write the dirty pages that time out to disk to ensure that dirty pages do not reside in memory indefinitely.

After the write-back begins, Pdflush continues to write data until the following two conditions are met:

1. The minimum number of pages already specified is written back to disk.
2. The free memory page has been picked up, exceeding the threshold value of dirty_background_ration.

System administrators can set write-back related parameters in/proc/sys/vm, or they can be set by Sysctl system tune. The following table shows the amount that can be set:

The writeback mechanism looks perfect, but not necessarily in practical work. One problem is that the Pdflush thread data is variable (2-8), but in the face of all the block device data, when a block device is slow, it is bound to block the writeback of other block devices. This 2.6.32 introduces a new Pdflush threading model, each with a separate pdflush thread, and the write-back between devices no longer interferes with each other.

write-back defects

It looks like this perfect writeback mechanism, there are 2 defects in practice: 1) write-back does not trigger lost data in time (Sync|fsync); 2) Read IO Performance is very poor during write-back, especially when large data volumes.

linux:2.6.16
Memory: 32G
Test process: The speed limit is continuously appended to the disk, Velocity 20-10mb/s,
Measured curve:

Conclusion: Disk IO is extremely consumed during Pdflush write-back and severely affects read performance.



IO scheduling algorithm and writeback mechanism for Linux block devices

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.