(WIP) I/O elevator algorithms on SSDs and hugepage settings may result in crash (by Quqi99)

Source: Internet
Author: User

Zhang Hua posted: 2016-03-24
Copyright notice: Can be reproduced arbitrarily, please be sure to use hyperlinks in the form of the original source of the article and the author's information and this copyright notice
(http://blog.csdn.net/quqi99)

A process in the VM is stuck, use the "cat/proc/diskstats" command to see a lot of request queues on an SSD drive.
Hung_task_timeout_secs parameter and D state process wait IO if in the D state, that is, the task_uninterruptible state, the process in this state does not process the signal, so kill does not lose, if the process is in the D state for a long time, then certainly not normal , there may be two reasons:
    1. Hardware problems on the IO path, such as hard drive failure (only a few cases result in long-term d, usually returning an error);
    2. The kernel has its own problem.
This problem is not easy to locate, and once it is usually unrecoverable, kill does not drop, usually only restart recovery. The kernel developed a hung task detection mechanism for this, the basic principle is: timing detection system in the D state of the process, if it is in the D state for more than the specified time (the default 120s, can be configured), the relevant stack information is printed, The proc parameter can also be configured to make it directly panic (or it may be that the kernel has an additional error triggering crash,crash time is too long over 2 minutes for its own process to be detected by the hung task as a D state).
1) Set timeout time:
echo >/proc/sys/kernel/hung_task_timeout_secs
2) Whether to trigger panic after setting hung task
Echo 1 >/proc/sys/kernel/hung_task_panic

The conversion of the hugepage with the D-State logical address to the linear address is made by fragmentation, and the transformation of the linear address to the physical address is done by paging. The default per page is 4K, in order to reduce the page map table entries, you can increase the size of the page, Hugepage hence the name. THP (Transparent Huge pages) is an abstraction layer that enables the administration of Huge pages automation (because of the implementation approach, THP can cause memory locks to affect performance, especially when the program is not specifically developed for internal memory pages. Because khugepaged will scan the memory used by all processes in the background, the 4K page will be swapped to huge Pages where possible, and memory locks will be required to affect performance in this process. In Redhat6, the khugepaged process starts automatically at boot, and if you want to turn it off, you can:
1, echo "Never" >/sys/kernel/mm/redhat_transparent_hugepage/enabled
Cat/sys/kernel/mm/redhat_transparent_hugepage/enabled
Always madvise [never]
2, or using CmdLine, Transparent_hugepage=never
The never option clears the bit values in Transparent_hugepage_flag and Transparent_hugepage_req_madv_flag,


Memory lock with D state there is a process in order to make it not be converted into swap to reduce efficiency by calling Mlock to lock memory, Mlock calls Lru_add_drain_all to work on each CPU Lru_add_drain_per_ The CPU callback Pagevec, and then call Flush_work wait for completion, Flush_work (Kworker) will also tune Wait_for_completion, the next call Do_wait_for_common the process is set to the D state.

Static Long __sched
Wait_for_common (struct completion *x, long timeout, int state)
{
return __wait_for_common (x, Schedule_timeout, timeout, state);
}

Static inline Long __sched
__wait_for_common (struct completion *x,
Long (*action) (long), long timeout, int state)
{
Might_sleep ();

SPIN_LOCK_IRQ (&x->wait.lock);
Timeout = Do_wait_for_common (x, action, timeout, state);
SPIN_UNLOCK_IRQ (&x->wait.lock);
return timeout;
}

Do_wait_for_common (struct completion *x,
Long (*action) (long), long timeout, int state)
{
if (!x->done) {
Declare_waitqueue (wait, current);


__add_wait_queue_tail_exclusive (&x->wait, &wait);
do {
if (Signal_pending_state (state, current)) {
Timeout =-erestartsys;
Break
}
__set_current_state (state);
SPIN_UNLOCK_IRQ (&x->wait.lock);
Timeout = action (timeout);
Spin_lock_irq (&x->wait.lock);
} while (!x->done && timeout);
__remove_wait_queue (&x->wait, &wait);
if (!x->done)
return timeout;
}
x->done--;
Return timeout?: 1;
}

The D state of the missing pages will also enter the D state (The fault will be D state), the fault service program Do_page_fault call Down_read (&mm->mmap_sem) to read the lock, and then adjust __down_read, Re-tune Rwsem_down_failed_common set D status:
/* Wait to be given the lock */
for (;;) {
Read lock has been acquired, jump out of the dead loop
if (!waiter.task)
Break
Schedule (); Otherwise, dispatch out, discard the CPU
Reset state after dispatching back
set_task_state (tsk, task_uninterruptible);
}
Change to R state after acquiring lock
Tsk->state = task_running;


How to crash this page (Http://www.oenhan.com/rwsem-realtime-task-hung) describes a crash situation, the above kworker process into the D state, If there is another real-time process H priority greater than kworker, the use of FIFO scheduling mode, and h in a large number of services that it will always preempt the CPU is not released, so that the kworker has not been dispatched, resulting in a direct kworker process D State Trigger Hung_ Task_timeout_secs directly to die.
The same CPU core process has a H2, and H1 and H2 are CPU bound to 5 cores, H2 processing the main business, H1 Auxiliary, H2 real-time priority high, H1 real-time priority low, when H2 pressure to occupy 100% CPU, H1 will not be dispatched.
High business pressure, H2 always occupy the CPU, is the R state, H1 is not scheduled, is the S state. The first is the khugepaged scan when the write lock, and contains H2 multiple h thread request read lock because of a fault, H2 in D State was schedule let out CPU,H1 also just a fault, get CPU after request read lock, after H2, is also D state. Khugepaged The Write lock is released, the khugepaged first pulls all the processes waiting for the read lock into a read-write lock and resets it to the task_waking State (refer to the __rwsem_do_wake function). H2 first dispatched back to obtain a read lock and fully occupy the 5-core CPU, read lock support concurrency, but H1 at this time there is no CPU to use, no scheduling, has been in schedule circles, although the process is not scheduled, but H1 has been khugepaged pulled to read and write locks, occupied the read lock, has not released , khugepaged application write lock can not be completed, subsequent more H-pages interrupt application read lock is also blocked, D-state reached 20s. As for the following can recover, because H2 again because the missing pages to apply for reading locks, queued to the queue, into the D State, schedule let CPU,H1 to get the CPU again, after the completion of work to release read lock.

Workaround the I/O path in the virtual machine is: application-File/block in VM--VIRTIO-BLK driver in VM--virtio-backend driver in Host File/block in host, SSD disk in host
    • VM (File/block in VM--virtio-blk driver in VM), because running in QEMU, the I/O scheduling algorithm of the guest machine should use Elevator=noop
    • Hypervisor (virtio-blk driver in VM-virtio-backend driver in Host), two I/O mechanisms (io= ' native ' or io= ' threads '), five caching mechanisms (WRI Tethrough, writeback, none, Directsync, unsafe), example: <driver name= ' qemu ' type= ' raw ' cache= ' writethrough ' io= ' native '/ >
    • Host (File/block in host, SSD disk in host), because of the use of SSD, I/O scheduling algorithm should use Elevator=noop
Therefore, it is necessary to add CmdLine startup parameters and close the khugepaged process in grub: Transparent_hugepage=never Elevator=noopOther than that

Annex I, how to confirm I/O scheduling algorithm [email protected]:~$ sudo lsblk-io kname,type, SCHED
Kname TYPE SCHED
SDA disk deadline
SDA1 part deadline
Sda2 part deadline
SDA5 part deadline
Sda6 part deadline
SDA7 part deadline
SDA8 part deadline
SDA9 part deadline
SDA10 part deadline
SDB disk deadline

[Email protected]:~$ Cat/sys/block/sda/queue/scheduler
NoOp [Deadline] Cfq

Http://www.circlingcycle.com.au/Unix-sources/Linux-check-IO-scheduler-and-discard-support.pl.txt
[Email protected]:~$ perl./linux-check-io-scheduler-and-discard-support.pl
Info:file Systems and raids
NAME Fstype LABEL Mountpoint
Sda
├─SDA1 [SWAP]
├─sda2
├─sda5/bak
├─sda6/data1
├─sda7/win
├─sda8/data2
├─SDA9/
└─sda10/images
Sdb


Info:block devices
NAME ALIGNMENT min-io opt-io phy-sec log-sec ROTA SCHED rq-size
SDA 0 4096 0 4096 1 deadline 128
├─SDA1 0 4096 0 4096 1 deadline 128
├─sda2 1024x768 4096 0 4096 1 deadline 128
├─sda5 0 4096 0 4096 1 deadline 128
├─sda6 0 4096 0 4096 1 deadline 128
├─SDA7 0 4096 0 4096 1 deadline 128
├─sda8 0 4096 0 4096 1 deadline 128
├─SDA9 0 4096 0 4096 1 deadline 128
└─SDA10 0 4096 0 4096 1 deadline 128
SDB 0 1048576 2048 1 deadline 128


INFO:I/O Elevator (scheduler) and discard support summary
Info:hard Disk SDA configured with I/O Scheduler "deadline"
Info:hard Disk SDB configured with I/O Scheduler "deadline" supports discard operation

(WIP) I/O elevator algorithms on SSDs and hugepage settings may result in crash (by Quqi99)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.