Solution to process Kernel stack overflow caused by XFS
System Environment
- System Version: CentOS release 6.5
- Kenel version: 2.6.32-431.20.3.el6.x86 _ 64
- File System: XFS
Problem description
System panic and print the following calltrace information:
kvm: 16396: cpu1 unhandled wrmsr: 0x391 data 2000000fBUG: scheduling while atomic:qemu-system-x86/27122/0xffff8811BUG: unable to handle kernel paging request at 00000000dd7ed3a8IP: [<fffffff81058e5d>] task_rq_lock+0x4d/0xa8PGD 0Oops:0000 [#1] SMPlast sysfs file: /sys/devices/pci0000:00/0000:00:02.2/0000:04:00.0/host0/target0:2:1/0:2:1/block/sdb/queue/logical_block_size...[<ffffffff81058e5d>] ? task_rq_lock+0x4d/0xa0[<ffffffff8106195c>] ? try_to_wakeup+0x3c/0x3e0[<ffffffff81061d55>] ? wake_up_process+0x15/0x20[<ffffffff810a0f62>] ? __up+0x2a/0x40[<ffffffffa03394c2>] ? xfs_buf_unlock+0x32/0x90 [xfs][<ffffffffa030297f>] ? xfs_buf_item_unpin+0xcf/0x1a0 [xfs][<ffffffffa032f18c>] ? xfs_trans_committed_bulk+0x29c/0x2b0 [xfs][<ffffffff81069f15>] ? enqueue_entity+0x125/0x450[<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70[<ffffffff81069973>] ? dequeue_entity+0x113/0x2e0[<ffffffffa032326d>] ? xlog_cli_committed+0x0x3d/0x100 [xfs][<ffffffffa031f79d>] ? xlog_state_do_callback+0x15d/0x2b0 [xfs][<ffffffffa031f96e>] ? xlog_state_done_syncing+0x7e/0xb0 [xfs][<ffffffffa03200e9>] ? xlog_iodone+0x59/0xb0 [xfs][<ffffffffa033ae50>] ? xfs_buf_iodone_work+0x0/0x50 [xfs][<ffffffffa033ae76>] ? xfs_buf_iodone_work+0x26/0x50 [xfs]
As follows:
Error Tracking
Unable to handle kernel paging request at least writable dd7ed3a0
Zookeeper dd7ed3a0 is the address of the user space and won't be accessed by the kernel normally. Therefore, it can be identified as a BUG in the kernel.
IP: [<ffffffff81058e5d>] task_rq_lock + 0x4d/0xa8
Because kdump is not deployed in the system, you can only use objdump for static analysis to further track the wrong command address.
ffffffff81058e10 <task_rq_lock>: * interrupts. Note the ordering: we can safely lookup the task_rq without * explicitly disabling preemption. */ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags) __acquires(rq->lock) { ffffffff81058e10: 55 push %rbp ffffffff81058e11: 48 89 e5 mov %rsp,%rbp ffffffff81058e14: 48 83 ec 20 sub $0x20,%rsp ffffffff81058e18: 48 89 1c 24 mov %rbx,(%rsp) ffffffff81058e1c: 4c 89 64 24 08 mov %r12,0x8(%rsp) ffffffff81058e21: 4c 89 6c 24 10 mov %r13,0x10(%rsp) ffffffff81058e26: 4c 89 74 24 18 mov %r14,0x18(%rsp) ffffffff81058e2b: e8 10 1f fb ff callq ffffffff8100ad40 <mcount> ffffffff81058e30: 48 c7 c3 40 68 01 00 mov $0x16840,%rbx ffffffff81058e37: 49 89 fc mov %rdi,%r12 ffffffff81058e3a: 49 89 f5 mov %rsi,%r13 ffffffff81058e3d: ff 14 25 80 8b a9 81 callq *0xffffffff81a98b80 ffffffff81058e44: 48 89 c2 mov %rax,%rdx PVOP_VCALLEE1(pv_irq_ops.restore_fl, f); } static inline void raw_local_irq_disable(void) { PVOP_VCALLEE0(pv_irq_ops.irq_disable); ffffffff81058e47: ff 14 25 90 8b a9 81 callq *0xffffffff81a98b90 struct rq *rq; for (;;) { local_irq_save(*flags); ffffffff81058e4e: 49 89 55 00 mov %rdx,0x0(%r13) rq = task_rq(p); ffffffff81058e52: 49 8b 44 24 08 mov 0x8(%r12),%rax ffffffff81058e57: 49 89 de mov %rbx,%r14 ffffffff81058e5a: 8b 40 18 mov 0x18(%rax),%eax ffffffff81058e5d: 4c 03 34 c5 60 cf bf add -0x7e4030a0(,%rax,8),%r14 ffffffff81058e64: 81 spin_lock(&rq->lock); ffffffff81058e65: 4c 89 f7 mov %r14,%rdi ffffffff81058e68: e8 a3 23 4d 00 callq ffffffff8152b210 <_spin_lock>
Use objdump to disassemble vmlinux and locate the wrong command. When the address ffffffff81058e5d is run, the system fails. Find the corresponding code segment and find that the error occurs when task_rq_lock () calls task_rq.
Kernel/sched. c
#define task_rq(p) cpu_rq(task_cpu(p)) /* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without * explicitly disabling preemption. */ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags) __acquires(rq->lock) { struct rq *rq; for (;;) { local_irq_save(*flags); rq = task_rq(p); spin_lock(&rq->lock); if (likely(rq == task_rq(p))) return rq; spin_unlock_irqrestore(&rq->lock, *flags); } }
Include/linux/sched. h
#define task_thread_info(task) ((struct thread_info *)(task)->stack) static inline unsigned int task_cpu(const struct task_struct *p) { return task_thread_info(p)->cpu; } union thread_union { struct thread_info thread_info; unsigned long stack[THREAD_SIZE/sizeof(long)]; };
Finally, we can see that the thread_info and kernel stack of the original process are in a union, and thread_info is damaged due to kernel stack overflow. Let's take a look at the kernel stack size:
Arch/x86/include/asm/page_64_types.h
#define THREAD_ORDER 1 #define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER) #define CURRENT_MASK (~(THREAD_SIZE - 1))
In a 64-bit system, the kernel stack size is 8 KB.
The thread_info structure and the kernel-state stack structure of the Process coexist in a union structure. The total size of the structure is 8 KB by default. The XFS process uses too much stack space for some reason, leading to stack overflow and damage the thread_info structure.
"Scheduling while atomic" should be caused by a stack overflow that overwrites the preemptible count (preempt count) in the thread_info struct of the process. As a result, the preemption count is non-zero when it is awakened next time, and panic appears.
Cause Analysis
According to the objdump analysis, there are two possibilities for stack overflow caused by XFS:
One possibility is that the xfs_iomap_write_direct () function does not use the XFS_BMAPI_STACK_SWITCH flag, causing the xfs_bmapi_allocate to be allocated to a new thread instead (the new thread can ensure a sufficient stack ), instead, it is directly allocated to the process's own kernel stack, resulting in the process's kernel stack overflow.
This bug is fixed in kernel-3.4 (commit c999a22 "xfs: introduce an allocation workqueue.
There is another controversy that using a dedicated allocation task force column will cause IO write-back to slow down due to the increase in system overhead of thread creation, in addition, the 8 K kernel stack is still helpless for processes with more than 8 K calling depth, so kernel-3.16 is introduced (6538b8e x86_64: expand kernel stack to 16 K)
Kernel discussion group commit (commit c999a22 "xfs: introduce an allocation workqueue") divides the writeback stack into a worker thread and the extended kernel stack is 16 K (6538b8e x86_64: expand kernel stack to 16 K) you can read these two solutions if you are interested.
Currently, centos 2.6.32-520. el6 has pulled the kernel-3.16 patch (6538b8e x86_64: expand kernel stack to 16 K) from mainline. The two patches do not conflict. We recommend that you first upgrade the kernel to see if the extended kernel stack is 16 K to solve the xfs_iomap_write_direct problem. If not, you can further resolve the issue (commit c999a22 "xfs: introduce an allocation workqueue.
Another possible cause is that the xfs_buf_lock () function executes a log force operation just before being blocked by a semaphore, while the log force calls are deep, causing a high stack consumption, cause system panic. It is the same as bug 1028831 in centos kernel changelog, which has been fixed in 2.6.32-495. el6.
Solution
Upgrade kenel to 2.6.32-520. el6 to ensure that the patch is included.
Changelog
[2.6.32-520. el6]
- [kernel] x86_64: expand kernel stack to 16K (Johannes Weiner) [1045190 1060721]
[2.6.32-495. el6]
- [fs] xfs: always do log forces via the workqueue (Eric Sandeen) [1028831]
- [fs] xfs: Do background CIL flushes via a workqueue (Eric Sandeen) [1028831]
This article permanently updates the link address: