Linux io Scheduler (Linux IO Scheduler)
Each block device or partition of a block device has its own request queue (request_queue), and each request queue can select an I/O Scheduler to coordinate the request submitted . The basic purpose of the I/O Scheduler is to arrange requests according to the sector code they correspond to on the block device to reduce the movement of the heads and improve efficiency. Requests in the request queue for each device are responded to in order. In fact, in addition to this queue, each scheduler itself maintains a different number of queues, which are used to process requests that are submitted, while the request at the front of the queue is moved to the request queue in time to wait for a response.
The IO Scheduler is located in the kernel stack as follows:
< Span lang= "ZH-CN" > < Span lang= "X-none" > The kernel implements io Scheduler has four main types --noop,deadline,cfg
/span>
< Span lang= "ZH-CN" > < Span lang= "X-none" > 1,noop algorithm
< Span lang= "ZH-CN" > < Span lang= "X-none" > noop scheduling algorithm is the simplest IO scheduling algorithm in the kernel. The NoOp scheduling algorithm is also called an elevator scheduling algorithm, it puts IO requests into a FIFO queue, and then executes these IO requests, of course, for some continuous IO requests on disk, the NOOP algorithm will do some appropriate merging. This scheduling algorithm is especially suitable for applications that do not want the scheduler to reorganize the order of IO requests.
< Span lang= "ZH-CN" > < Span lang= "X-none" > This scheduling algorithm has obvious advantages in the following scenarios:
< Span lang= "ZH-CN" > < Span lang= "X-none" > 1) has a smarter IO scheduling device under the IO Scheduler. If your block device drivers is raid, or a storage device such as San,nas, these devices will better organize IO requests without the IO Scheduler to perform additional scheduling tasks;
< Span lang= "ZH-CN" > < Span lang= "X-none" > &NBSP;2) applications on the upper level are more understanding of the underlying device than the IO scheduler. or the upper-level application to reach the IO Scheduler's IO request has been carefully optimized, then the IO Scheduler does not need to superfluous, only need to sequentially execute the upper layer of the IO request.
< Span lang= "ZH-CN" > < Span lang= "X-none" > 3) for some non-rotating head storage devices, the use of noop more effective. Because the request reorganization of the IO Scheduler takes a certain amount of CPU time for a disk with a rotating head type, the CPU time of these reconfiguration IO requests can be saved for SSD disks because SSDs provide a smarter request scheduling algorithm that does not require the kernel to be superfluous. This article mentions that using the NoOp effect in SSDs is better.
< Span lang= "ZH-CN" > < Span lang= "X-none" > 2,deadline algorithm
< Span lang= "ZH-CN" > < Span lang= "X-none" > The core of the deadline algorithm is to ensure that each IO request must be serviced within a certain amount of time to avoid starvation of a request.
< Span lang= "ZH-CN" > < Span lang= "X-none" > deadline algorithm introduced four queues, these four queues can be divided into two categories, each class consists of two types of read and write queue, a class of queues to the request by the starting sector ordinal order, Organized by red-black trees, called sort_list, and the other sorts of requests by their generation time, organized by linked lists, called Fifo_list. (read or write), then a batch of successive requests will be sort_list from the corresponding dispatch to the Requst_queue request queue. The exact number is determined by Fifo_batch. Only the following three cases result in the end of a bulk transfer:
1) There is no request in the corresponding sort_list.
2) The next requested sector does not meet the incremental requirements
3) The previous request was already the last request for the bulk transfer.
All requests are assigned a period value (according to Jiffies) at the time of generation, and the time limit is sorted in Fifo_list, the duration of the read request is defaulted to 500ms, the duration of the write request is the default of 5s, it can be seen that the kernel to read the request is very eccentric, in fact, not only that, In the deadline scheduler, it also defines a starved and writes_starved,writes_starved default of 2, which can be understood as the starvation of write requests, the kernel always prioritizes read requests, starved indicates the number of read requests currently processed, Only when the starved exceeds the writes_starved will they consider writing the request. Therefore, if a write request deadline has been exceeded, the request will not necessarily be immediately responded to, because the read request batch has not finished processing, even after processing, you must wait until starved more than writes_starved to be able to respond. Why does the kernel favor read requests? This is considered in terms of overall performance. The relationship between the read request and the application is synchronous, because the application waits for the content to be read before it can proceed to the next step, so the read request blocks the process, and the write request is different, and when the application makes a write request, the memory content is not affected by the program when it is written to the block device, so the scheduler prioritizes the read request
By default, the time-out for a read request is 500ms, and the time-out for the write request is 5s.
This article says that in some multithreaded applications, the deadline algorithm is better than the CFQ algorithm. This article says that in some database applications, the deadline algorithm is better than the CFQ algorithm.
3,anticipatory Algorithm
The core of the anticipatory algorithm is the principle of locality, which expects a process to continue to make IO requests here after an IO request last night. In IO operation, there is a phenomenon called "false idle" (deceptive idleness), it means that a process has just finished a wave of read operations, it seems to be idle, not read, but in fact it is processing the data, after processing the data, it will continue to read, This time if the IO Scheduler is going to handle the request of another process, then when the next request of the original dummy idle process comes, the head will have to seek to the location just now, which greatly increases the seek time and the rotation time of the head. Therefore, the anticipatory algorithm will be completed after a read request, and then wait a certain time t (usually 6ms), if the 6ms, the process is still read requests come over, then I continue to serve, otherwise, processing the next process read and write requests.
In some scenarios, the antocipatory algorithm has a very effective performance boost. This article has said that this article also has a review.
It is worth mentioning that the anticipatory algorithm was removed from the Linux 2.6.33 version, because the CFQ can also be configured to achieve the effect of the anticipatory algorithm.
4,CFQ Algorithm
CFQ (Completely Fair Queuing) algorithm, as the name implies, an absolute fair algorithm. It attempts to assign a request queue and a time slice to all processes that compete for a block device, and the process can send its read-write request to the underlying block device when the scheduler is allocated to the system, and the process's request queue is suspended for scheduling when the process's time slice is exhausted. The time slices per process and the queue length per process depend on the IO priority of the process. Each process will have a iocfq Scheduler will consider it as one of the factors to determine when the request queue for the process can acquire the use of block devices. io:rt (real Time), be (best try), idle (idle), rtbe8 child priority. In fact, we already know that cfq Scheduler fairness is for the process, and only the synchronization request (read syn write) 8 (RT) +8 (BE) +1 (IDLE) =17
From Linux 2.6.18, CFQ as the default IO scheduling algorithm.
For a common server, CFQ is a good choice.
For the use of which scheduling algorithm, or according to the specific business scenario to do the foot benchmark to choose, can not rely on other people's text to decide.
5. Change the IO scheduling algorithm
In Rhel5/oel5 and later versions (such as RHEL6 and RHEL7), I/O Scheduler can be set for each disk, and the modifications will take effect immediately, such as:
$ cat/sys/block/sda1/queue/scheduler[noop] Anticipatory deadline cfq# modified to cfq$ echo ' CFQ ' >/sys/block/sda1/queue/ scheduler# immediate effect $ cat/sys/block/sda1/queue/schedulernoop anticipatory deadline [CFQ]
6, some disk-related kernel parameters
/sys/block/sda/queue/nr_requests Disk Queue Length. By default, there are only 128 queues, which can be increased to 512. More memory-intensive, but more merge read and write operations, slower, but can read and write more
/sys/block/sda/queue/iosched/antic_expire wait time. How long to wait for new requests generated nearby
/sys/block/sda/queue/read_ahead_kb This parameter is useful for sequential reads, meaning how much content to read in advance, no matter how much you actually need. The default reading of 128KB is much smaller than the read, setting larger is very useful to read large files, can effectively reduce the read The number of seek, this parameter can be set using Blockdev–setra, Setra set how many sectors, so the actual byte is divided by 2, such as setting 512, is actually read 256 bytes. /proc/sys/vm/dirty_ratio
This parameter controls the file system's file system write buffer size, expressed as a percentage, representing the percentage of system memory, indicating how much memory is being written to the disk when the write buffer is used. Increasing the use of more system memory for disk write buffering can also greatly improve the write performance of the system. But, When you need a continuous, constant write situation, you should lower its value, generally starting on the default is 10. Here's how to increase it: echo ' >
/proc/sys/vm/dirty_background_ratio
This parameter controls the Pdflush process of the file system, when the disk is flushed. The unit is a percentage that represents the percentage of system memory, meaning that when the write buffer is used to the amount of system memory, Pdflush begins to write data to the disk. Increased use of more system memory for disk write buffering It can also greatly improve the write performance of the system. However, when you need a continuous, constant write situation, you should lower its value, generally starting on the default is 5. The following is an enlarged method: Echo ' >
/proc/sys/vm/dirty_writeback_centisecs
This parameter controls the run interval of the pdflush of the kernel's dirty data refresh process. The unit is 1/100 seconds. The default value is 500, which is 5 seconds. If your system is continuously writing to the action, then actually it is better to lower the value This allows the spike write operation to be flattened to multiple writes. Set the method as follows: Echo ' s >/proc/sys/vm/dirty_writeback_centisecs if your system is short-term spike-type write operations and write data is small (dozens of m/ Times) and the memory is more affluent, you should increase this value: Echo ' >/proc/sys/vm/dirty_writeback_centisecs
/proc/sys/vm/dirty_expire_centisecs
This parameter declares that the data in the Linux kernel write buffer is "old", and the Pdflush process begins to consider writing to disk. The unit is 1/100 seconds. The default is 30000, which means that 30 seconds of data is old, and the disk will be flushed. For specially overloaded write operations, This value is good to narrow down, but not too much, because shrinking too much also causes the IO to improve too fast. The recommended setting is 1500, which is 15 seconds old. Echo ' >/proc/sys/vm/dirty_expire_centisecs of course, if your system has large memory, and the write mode is intermittent, and the data written every time is small (say, dozens of M), then this value is better.
Reference documents
1,https://en.wikipedia.org/wiki/noop_scheduler
2,https://en.wikipedia.org/wiki/deadline_scheduler
3,https://en.wikipedia.org/wiki/anticipatory_scheduling
4,https://en.wikipedia.org/wiki/cfq
5,http://www.redhat.com/magazine/008jun05/features/schedulers/This article introduces four IO scheduling algorithms and makes a review of their application scenarios.
6,http://www.dbform.com/html/2011/1510.html
7,https://support.rackspace.com/how-to/configure-flash-drives-in-high-io-instances-as-data-drives/ This article describes the configuration of IO Scheduler in some SSDs
8,https://www.percona.com/blog/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/
9,http://www.ibm.com/support/knowledgecenter/api/content/linuxonibm/liaat/liaatbestpractices_pdf.pdf
10,http://dl.acm.org/citation.cfm?id=502046&dl=guide&coll=guide
11,http://www.nuodb.com/techblog/tuning-linux-io-scheduler-ssds
Category: Linux system technology, distributed storage
Linux IO Scheduler