Linux crash Hardlock Troubleshooting records

Source: Internet
Author: User

3.10.0-327 's kernel, Crash records as follows:

Kernel:vmlinux
Dumpfile:vmcore [PARTIAL DUMP]
cpus:48
date:wed OCT 18 20:37:18 2017
Uptime:1 days, 09:43:06
LOAD average:13.42, 10.66, 9.48
tasks:7329
Nodename:host-10-229-143-10
Release:3.10.0-327.22.2.el7.x86_64
VERSION: #1 SMP Fri Sep 15:13:08 CST 2017
machine:x86_64 (2199 Mhz)
memory:383.6 GB
PANIC: "Kernel panic-not Syncing:watchdog detected hard lockup on CPU 10"
pid:24023
COMMAND: "Fas_readwriter"
task:ffff882f460a2e00 [thread_info:ffff882f10c44000]
Cpu:10
State:task_running (PANIC)-----------------------------------------R-State deadlock, the process is in task_running state for a long time to preempt the CPU without switching, generally, After the process has been preempted, the task has been executed, or the process is in a dead loop or sleep after a preemption, which often causes multiple CPUs to interlock and the whole system to be abnormal.

Crash> BT
pid:24023 task:ffff882f460a2e00 cpu:10 COMMAND: "Fas_readwriter"
#0 [Ffff882fbfd459c8] machine_kexec at ffffffff81051c5b
#1 [Ffff882fbfd45a28] crash_kexec at FFFFFFFF810F3EC2
#2 [Ffff882fbfd45af8] Panic at FFFFFFFF816326D1
#3 [ffff882fbfd45b78] watchdog_overflow_callback at ffffffff8111d0e2
#4 [ffff882fbfd45b88] __perf_event_overflow at FFFFFFFF811608D1
#5 [FFFF882FBFD45C00] Perf_event_overflow at FFFFFFFF811613A4
#6 [FFFF882FBFD45C10] Intel_pmu_handle_irq at ffffffff81032628
#7 [Ffff882fbfd45e60] Perf_event_nmi_handler at FFFFFFFF81642BCB
#8 [Ffff882fbfd45e80] Nmi_handle at ffffffff81642319
#9 [Ffff882fbfd45ec8] Do_nmi at ffffffff81642430
#10 [FFFF882FBFD45EF0] End_repeat_nmi at ffffffff81641753
[Exception rip:put_compound_page+336]
Rip:ffffffff81178b60 Rsp:ffff882f10c47d80 rflags:00000006
rax:006016c60138402c rbx:ffffea0123302a40 rcx:0000000000000022
rdx:0000000000000246 rsi:000000000a6a9000 rdi:ffffea0123300000
Rbp:ffff882f10c47d98 R8:ffff882f10c47dc8 r9:ffff882f10c47d74
r10:ffff880000000298 r11:000000000a6aa000 r12:ffffea0123300000
r13:0000000000000246 r14:0000000000000000 R15:ffffea0123302a40
ORIG_RAX:FFFFFFFFFFFFFFFF cs:0010 ss:0018
---<NMI exception stack>---
#11 [Ffff882f10c47d80] put_compound_page at ffffffff81178b60
#12 [ffff882f10c47da0] put_page at Ffffffff81178bac
#13 [Ffff882f10c47db0] Get_futex_key at Ffffffff810e3c86
#14 [Ffff882f10c47e08] Futex_wake at FFFFFFFF810E3F1A
#15 [Ffff882f10c47e70] Do_futex at Ffffffff810e6a12
#16 [ffff882f10c47f08] Sys_futex at Ffffffff810e6f20
#17 [ffff882f10c47f80] System_call_fastpath at ffffffff81649909

First, the general Hardlock is triggered because the shut-off time is too long, so you need to find out if there is such a handle in the corresponding stack, and the common functions such as spinlock,irq_disable and so on.

According to the stack, Get_futex_key has a code like this:

#ifdef Config_transparent_hugepage
Page_head = page;
if (Unlikely (Pagetail (page))) {
Put_page (page);
/* Serialize against __split_huge_page_splitting () */
Local_irq_disable ();-------------------------------------------------------------------------off interrupt
if (Likely (__get_user_pages_fast (address, 1,!ro, &page) = = 1) {------------------called __get_user_pages_fast
Page_head = compound_head (page);
/*
* Page_head is valid pointer but we must pin
* It before taking the Pg_lock and/or
* Pg_compound_lock. The moment we re-enable
* IRQs __split_huge_page_splitting () can
* Return and the head page can be freed from
* under us. We can ' t take the pg_lock and/or
* Pg_compound_lock on a page this could be
* Freed from under us.
*/
if (page! = page_head) {
Get_page (Page_head);
Put_page (page);
}
Local_irq_enable ();
} else {
Local_irq_enable ();
Goto again;
}
}
#else
Page_head = compound_head (page);
if (page! = page_head) {
Get_page (Page_head);
Put_page (page);
}
#endif

Determine if the next config_transparent_hugepage is configured:

grep config_transparent_hugepage/boot/config-3.10.0-327.22.2.el7.x86_64
Config_transparent_hugepage=y

Instructions have been configured, Disassembly Get_futex_key confirmation, through a simple search __get_user_pages_fast whether the compilation confirms that the transparent giant page is actually turned on.

Next, you need to analyze why the Put_page call put_compound_page time does not return.

void put_page (struct page *page)
{
if (Unlikely (Pagecompound (page)))
Put_compound_page (page);
else if (Put_page_testzero (page))
__put_single_page (page);
}

The/proc/sys/vm/nr_hugepages of the system configuration is 0.

Linux crash Hardlock Troubleshooting records

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.