IO scheduling algorithm and writeback mechanism for Linux block devices

Source: Internet
Author: User

**************************************************************************************

References:

"Linux kernel design and implementation"

http://laokaddk.blog.51cto.com/368606/699028/

Http://www.cnblogs.com/zhenjing/archive/2012/06/20/linux_writeback.html

**************************************************************************************

1 Linux block IO requests

The smallest addressable unit in a block device is a sector. The size of the sector is usually 2 integer times. The most common size is 512 bytes. The size of the sector is the physical property of the device, the sector is the basic unit of all block devices, block devices cannot be addressed and manipulated against its smaller units, but many block devices can transmit multiple sectors at once. From a software perspective. The smallest logical addressable unit is a block, which is an abstraction of the file system-----can only access the file system based on blocks. Although the physical disk addressing is done at the sector level, all the disk operations that the kernel is running are in accordance with blocks.

The front has already said. A sector is the smallest addressable unit of a device. Therefore, the block can not be smaller than the sector, just a few times the sector size.

In addition the kernel also requires that the block size is 2 integer times, = and cannot exceed the length of a page, so the size of the last requirement is that the sector size must be an integer multiple of 2. and is smaller than the page size. So the usual block size is 512 bytes. 1k or 4k.

When a block is called into memory, it is stored in a buffer, each buffer corresponding to a block, which corresponds to the disk block in memory representation.

Also, because the kernel needs some control information to process the data, each buffer has a descriptive descriptor called Buffer_head, which is called the buffer header, defined in the linux/buffer_head.h.


This structure plays a role in the kernel that describes the descriptor. Describes the mapping relationship from buffer to block.

Using the buffer header as an I/O operation has its drawbacks. Don't dwell here, you are clear. We just need to know that today's kernel uses a new type. Flexible and lightweight containers---bio-structured body.
The bio structure is defined in Linux/bio.h, which represents block I/O operations that are being organized in the field (active) as a fragment (segment) list. A fragment is a small contiguous memory buffer. In this case, there is no need to ensure that a single buffer must be contiguous, all through the fragment to describe the narrative buffer, even if a buffer is scattered in multiple locations of memory, the bio structure can also guarantee the operation of I/O operations. The following is a description of the bio structure and the various fields, such as the following:

struct Bio {sector_t bi_sector;          /* Associated sector on disk */struct bio *bi_next;          /* List of requests */struct block_device *bi_bdev;          /* Associated block device */unsigned long bi_flags;             /* Status and command flags */unsigned long BI_RW; /* Read or write?           */unsigned short bi_vcnt;            /* Number of bio_vecs off */unsigned short bi_idx;  /* Current index in Bi_io_vec */unsigned short bi_phys_segments;    /* Number of segments after coalescing */unsigned short bi_hw_segments;           /* Number of segments after remapping */unsigned int bi_size;  /* I/O count */unsigned int bi_hw_front_size;   /* Size of the first mergeable segment */unsigned int bi_hw_back_size; /* Size of the last mergeable segment */unsigned int Bi_max_vecs;        /* Maximum bio_vecs possible */struct BIO_VEC *bi_io_vec;        /* Bio_vec list */bio_end_io_t *bi_end_io;            /* I/O completion method */atomic_t bi_cnt;       /* Usage counter */void *bi_private;    /* Owner-private method */bio_destructor_t *bi_destructor; /* destructor method */};

The purpose of using the bio structure is primarily to represent I/O operations that are running on-site, all of which are used to manage the relevant information in the primary domain. The most important of these fields are bi_io_vecs,bi_vcnt and bi_idx. The relationships between them, for example, are as seen in:


I have already given the structure of the struct bio in front of me. The following is a descriptive description of struct Bio_vec:

struct Bio_vec {   struct page     *bv_page;   /* Pointer to the physical page in which this buffer resides */    unsigned int    bv_len;     /* The length in bytes of this buffer */    unsigned int    bv_offset;   /* The byte offset within the page where the buffer resides */};


To analyze the above diagram above, we say: each block I/O request is represented by a bio struct.

Each request consists of one or more blocks. These blocks are stored in an array of BIO_VEC structures that describe the actual position of each piece in the physical page, and are organized like a vector, the first fragment of the IO operation is pointed to by the B_IO_VEC structure, and the other fragments are then placed sequentially, with the common Bi_ Vcnt a fragment. When block IO starts to run the request, the Bi_idx field is constantly updated to always point to the current fragment when the individual fragments are needed. The Bi_idx field points to the current Bio_vec fragment in the array, through which the block IO layer tracks the completion progress of the block IO operation.

But the more critical data for the domain is cutting the bio structure. bi_cnt Field records the usage count of the bio structure, assuming 0, the bio struct should be destroyed and the memory it occupies is freed.

Use the following two functions to manage the usage count:

void Bio_get (struct bio *bio);
void Bio_put (struct bio *bio);
The last domain is the bi_private domain. This is a private domain owned by the owner. Who created the bio structure will be able to read and write to that domain.
Block devices keep their pending block IO requests in the request queue. This queue has a request_queue structure body representation, defined in the file linux/blkdev.h, including a two-way request linked list and related control information. The request is added to the queue by a high-level code like the file system in the kernel. If the request queue is not empty, the corresponding block device driver of the queue gets the request from the queue header and then feeds it to the appropriate block device. Each item in the Request queue table is a separate request, represented by a reques struct body. Requests in the queue are represented by the struct request, defined in the file linux/blkdev.h. Because a request may operate on multiple contiguous blocks of disk, all of the requests can be made up of multiple bio-structured bodies. Note that although the blocks on the disk must be contiguous, the blocks in memory do not have to be contiguous----each bio structure can describe multiple fragments, and each request can also include multiple bio structures.


All right. We identified the block IO request, and the following is the IO dispatch. The operation of each address is to locate the disk head to a location on a specific block, and in order to optimize the addressing operation, the kernel will not simply receive the order as requested. It will not be submitted to disk immediately, instead. It will run a pre-operation called Merge and sort before committing. Such pre-operation can greatly improve the overall performance of the system.

2 Linux kernel block device IO subsystem

Linux IO Scheduler is a block device I/O The primary component of a subsystem, which is between a common block layer and a block device driver. For example , as seen in the. When the Linux kernel component reads and writes data, it does not run as soon as it is requested, but instead puts the request in the request (input) queue and delays the operation. Why is it so designed? The reason is that the most central block device Linux needs to deal with is disk.

Disk seek time severely restricts disk performance. If you want to improve disk IO performance, you must try to reduce the number of disk seek.

Block device I/O subsystem The most core task is to improve the overall performance of the block device, for this implementation of Linux four io scheduling algorithm, the basic idea of the algorithm is to consolidate and sort the requests in the IO request queue to greatly reduce the required disk seek time, This provides overall IO performance.

2.6the kernel implements four kinds ofIOscheduling algorithm. Expected for each(anticipatory)algorithms, deadlines(Deadline)algorithm, completely fair to the column(CFQ)Algorithms andNOOPalgorithm(No operation). When the kernel boots, the user can specify aI/OThe scheduling algorithm can also be executed at execution time bySysfsFile System/sys/block/sda/queue/schedulerchanging the block deviceI/Oscheduling Algorithm (cat can view the current usageIOscheduling algorithm). The default IO Scheduler is the "pre-measured" IO scheduler.

"Noop" algorithm

The simplest I/O scheduling algorithm.

The algorithm only appropriately merges user requests. Do not sort requests: New requests are usually inserted at the beginning or end of the dispatch queue. The next request to be processed is always the first request in the queue. Such an algorithm is designed for block devices that do not need to seek, such as SSDs.

"CFQ" algorithm

The primary goal of the CFQ (fully fair queue) algorithm is to ensure a fair allocation of disk I/O bandwidth in all processes that trigger I/O requests. To achieve this goal, the algorithm uses a number of sort queues-the default is 64--, which stores requests made by different processes. When the algorithm processes a request. The kernel calls a hash function to set the thread group identifier (PID) of the current process, and then the algorithm inserts a new request to the end of the queue. Therefore, requests made by the same process are usually inserted into the same queue.

The algorithm essentially scans the I/O input queue by polling. Select the first non-empty queue. Schedule a specific number of requests (fair) in different queues, and then move those requests to the end of the dispatch queue .

"Deadline" algorithm

In addition to scheduling queues, the "Deadline" algorithm uses four of queues. The two sorting queues include read requests and write requests, respectively. The requests are sorted according to the starting sector area code. The other two deadline queues include the same read and write requests, but this is based on their "deadlines" sorted. These queues are inductive to avoid starvation of requests. Because the elevator strategy (the previous scheduling algorithm) prioritizes the recent requests to the previous processing request, it is ignored for a very long time for a request. That's when it happens.

The deadline for a request is essentially a timeout timer. When the request is passed to the elevator algorithm, it starts timing. By default, the time-out for a read request is 500ms, and the time-out for a write request is 5s--read requests take precedence over write requests, as read requests usually clog the process that makes the request. The deadline guarantees that the scheduler takes care of the request that waits for a very long time, even if it is at the end of the sort queue.


When the algorithm wants to replenish the dispatch queue, the data direction of the next request is determined first.

Assuming that the same time you want to schedule read and write two requests, the algorithm chooses the "read" direction, unless the "write" direction has been discarded many times (in order to avoid the write request starve).

Next, the algorithm examines the deadline queue associated with the selected direction: Assuming that the deadline for the first request in the queue has been exhausted, the algorithm moves the request to the end of the dispatch queue. At the same time, it will also move a set of requests from the same sector area code of the sorted queue that follows the expired request. Assuming that the request to be moved is physically adjacent to the disk, the queue length is very long, otherwise it is very short.

At last. Assuming there is no request timeout, the algorithm dispatches a request from the same set of sectors that came after the last request from the sort queue. When the pointer reaches the end of the sort queue, the search starts again ("one-direction algorithm").

"expected" algorithm

The "expected" algorithm is one of the most complex 1/o scheduling algorithms provided by Linux.

Basically, it is an evolution of the "deadline" algorithm, borrowing the basic mechanisms of the "deadline" algorithm: Two deadline queues and two sort queues. The I/O Scheduler interactively scans the sort queue between read and write requests, but prefers to read requests.

The scan is basically contiguous unless a request times out. The default time-out for a read request is 125MS, and the default time-out for write requests is 250ms.

However, the algorithm also follows some additional heuristic guidelines:

In some cases, the algorithm may select a request after the current position of the sort queue, forcing the head to search from behind.

Such a situation usually occurs after the request is less than half the search distance after the current position of the sort queue.

The algorithm counts the kinds of I/O operations that are triggered by each process in the system. After a read request that was issued by a process P has just been dispatched. The algorithm immediately checks whether the next request in the sort queue is from the same process p.

Assume that the next request is dispatched immediately. Otherwise. View statistics about the process P: Suppose the process P is determined to emit a read request very quickly. Then delay a short period of time (the default is about 7ms).

Therefore, the algorithm predicts that the read request issued by the process P and the newly dispatched request may be "near-neighbor" on disk.

3 Linux write-back mechanism

No matter how to optimize the block device scheduling algorithm. It is also impossible to solve the problem of severe disk IO and CPU speed mismatch, which introduces a page cache for Linux. The first page cache is designed for memory management, and in the 2.6 kernel, various page-based data management is included in the page cache. Therefore, the IO buffers of the block devices are also part of the page fast cache. These have nothing to do with the user. is what kernel developers need to be concerned about. For developers, it is necessary to know that the IO operation of all files is read-write cache.

For read operations, only the IO operation is required when the data is not in the cache.

For the write operation, the IO operation is required, but the kernel writes the data to the fast cache and the write system call returns immediately, and the kernel uses a specific write process to write back the dirty cache page uniformly.

That is, the kernel to read and write is treated separately: "Synchronous read, asynchronous write."

the writeback of Linux is written back by a particular process according to a particular algorithm. The kernel before the 2.6.32. the Pdflush background writeback routine is responsible for completing this work. When do you write it back? The following two cases. Dirty pages will be written to disk:

1. When spare memory falls below a specific threshold, the kernel must write the dirty page back to the disk in order to free up memory.


2. When dirty pages reside in memory more than a certain threshold, the kernel must write the dirty pages that time out to disk. To ensure that dirty pages do not reside in memory indefinitely.

The Pdflush will continue to write data after the write-back begins. Until the following two conditions are met:

1. The minimum number of pages already specified is written back to disk.
2. The spare memory page has been picked up. The threshold dirty_background_ration is exceeded.

System administrators can set write-back related parameters in/proc/sys/vm, and they can be set by Sysctl system. The following table shows the amount that can be set:

The write-back mechanism looks perfect. It is not necessarily in the actual work.

One problem is that the Pdflush thread data is variable (2-8), but the data for all block devices is in the face. When a block device is very slow. Must block the write-back of other pieces of equipment.

For this 2.6.32 introduces a new Pdflush threading model, each block device has a separate pdflush thread, and the write-back between devices no longer interferes with each other.

write-back defects

It looks like this perfect writeback mechanism, there are 2 defects in practice: 1) write-back does not trigger lost data in time (Sync|fsync); 2) Read IO Performance is very poor during write-back, especially when large data volumes.

linux:2.6.16
Memory: 32G
Test process: The speed limit is continuously appended to the disk, Velocity 20-10mb/s,
Measured curve:

Conclusion: Disk IO is extremely consumed during Pdflush write-back and severely affects read performance.



IO scheduling algorithm and writeback mechanism for Linux block devices

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.