Concept Description
Linux kernel lockup is the Linux kernel consumes CPU, lockup is divided into two kinds: soft lockup and hard lockup.
Soft lockup refers to the CPU being occupied by kernel code so that other processes cannot be executed. The principle of detecting soft lockup is to assign each CPU a timed execution kernel thread [watchdog/x],
If the thread is not executed within the set deadline, it means that soft lockup,[watchdog/x] is a SCHED_FIFO real-time process with the highest priority of 99, with priority running privileges.
Hard lockup is more severe than soft lockup, and the CPU is not only able to perform other processes, but also no longer responds to interrupts. The principle of detecting hard lockup utilizes the PMU perf event,
Since NMI interrupts are not masked and can still be executed if the CPU is no longer responding to interrupts, it checks to see if the counter of the clock interrupt is hrtimer_interrupts in increments.
If stagnation means that the clock interrupt has not been responded to, that is, the hard lockup has occurred.
The code for the Linux kernel is implemented in KERNEL/WATCHDOG.C,
The subject involves 3 things: kernel thread, clock interrupt, NMI interrupt (non-shielded interrupt).
These 3 things have a different priority, followed by the kernel thread < clock interrupt < NMI interrupt.
Detection mechanism
Linux kernel designed a mechanism to detect lockup, called NMI Watchdog, is implemented using NMI interrupts, with NMI because lockup can occur when interrupts are blocked, the only way to get the CPU down is through NMI, Because NMI interrupts are not masked. The implementation of the kernel in NMI Watchdog that includes soft lockup detector and hard lockup detector,2.6 is as follows.
The triggering mechanism of NMI Watchdog consists of two parts:
1. A high-precision timer (hrtimer), the corresponding interrupt processing routine is KERNEL/WATCHDOG.C:WATCHDOG_TIMER_FN (), in this routine:
- To increment the counter hrtimer_interrupts, this counter is used by the hard lockup detector to determine if the CPU responds to interrupts;
- To wake the [watchdog/x] kernel thread, the task of the thread is to update a timestamp;
- Soft Lock Detector Check the timestamp, if more than soft lockup threshold has not been updated, it means [watchdog/x] has not been run, meaning the CPU is occupied, that is, soft lockup.
2. PMU-based NMI perf event, when the PMU counter overflow will trigger NMI interrupt, the corresponding interrupt processing routine is kernel/watchdog.c:watchdog_overflow_callback (),
Hard lockup detector in which it checks if the Hrtimer interrupt (hrtimer_interrupts) is increasing, and if stagnation indicates that the Hrtimer interrupt is not responding, that is, a hard lockup has occurred.
Parameter setting
The Hrtimer cycle is: SOFTLOCKUP_THRESH/5.
- In the 2.6 kernel:
The value of Softlockup_thresh is equal to the kernel parameter Kernel.watchdog_thresh, default 60 seconds;
- And in the 3.10 kernel:
Kernel parameter Kernel.watchdog_thresh name is unchanged, but the meaning becomes hard lockup threshold, default 10 seconds;
Soft lockup threshold is equal to (2*kernel.watchdog_thresh), which is the default of 20 seconds.
NMI Perf event is based on PMU, the trigger period (hard lockup threshold) in the 2.6 core is fixed 60 seconds, not manual adjustment, in the 3.10 core can be manually adjusted,
Because the kernel parameter Kernel.watchdog_thresh directly corresponds, the default value is 10 seconds.
Exception handling
What happens after the lockup is detected? Can be automatically panic, but also output bar information is finished, this can be defined by the kernel parameters:
- Kernel.softlockup_panic: Determines if the soft lockup is detected automatically panic, the default value is 0;
- Kernel.nmi_watchdog: Defines whether NMI watchdog is turned on, and if hard lockup causes panic, the format of the kernel parameter is "=[panic,][nopanic,][num".
Reference: Https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt
Analysis on lockup mechanism of Linux kernel