Linux processes exist in a variety of states, such as the running state of the task_running, the stop state of the exit_dead, and the wait status of the task_interruptible receive signal, etc. (can be viewed in include/linux/sched.h). One of these states waits for task_uninterruptible, called the D state, in which the process does not receive a signal and can only be awakened by WAKE_UP. There are many situations in this state, such as a mutex lock that may set the process to that State, and sometimes the process will set the process into that state while waiting for an IO resource to be ready (the wait_event mechanism). In general, the process will not be in this state for too long, but if the IO device fails or a process deadlock occurs, the process may be in that state for a long time and can no longer return to the task_running state. As a result, the kernel has designed a hung task mechanism specifically for detecting long-term D-state processes and alerting in order to facilitate the discovery of such situations. This article analyzes the source code of the kernel hung task mechanism and gives an example demonstration.
Hung Task Mechanism analysis
The kernel has introduced the hung task mechanism in an earlier version, this article takes the newer Linux 4.1.15 version source as an example to analyze, the code quantity is not many, the source code file is kernel/hung_task.c.
First, the whole process block diagram and design idea are given:
Figure D State Deadlock flowchart
Its core idea is to create a kernel monitoring process to cycle through each process (task) that is in the D state, to count the number of times they have been dispatched between two detections, to determine if there is no dispatch between two monitors, to judge that the process has been in the D state, is likely to have been deadlocked, and therefore to trigger alarm log printing , the basic information of the output process, stack backtracking, and register save information for the kernel developer to locate.
Here is a detailed analysis of how to implement:
static int __init hung_task_init (void) {Atomic_notifier_chain_register (&panic_notifier_list, &panic_block); Watchdog_task = Kthread_run (watchdog, NULL, "KHUNGTASKD"); return 0;} Subsys_initcall (Hung_task_init);
First, if this mechanism is enabled in the kernel configuration, the Hung_task_init () function is called during the Subsys initialization phase of the kernel to enable the function, first registering callbacks to the kernel's panic_notifier_list notification chain:
static struct Notifier_block Panic_block = {. Notifier_call = Hung_task_panic,};
The Hung_task_panic () function is called when the kernel triggers panic, and the function is later seen. Continue to initialize, call the Kthread_run () function to create a thread named KHUNGTASKD, execute the watchdog () function, and immediately attempt to dispatch execution. This thread is the background kernel thread dedicated to detecting the D-State deadlock process.
/* * Kthread which checks for tasks stuck in D state */static int watchdog (void *dummy) {set_user_nice (current, 0); ; {unsigned long timeout = Sysctl_hung_task_timeout_secs;while (schedule_timeout_interruptible (Timeout_jiffies ( Timeout)) Timeout = sysctl_hung_task_timeout_secs;if (atomic_xchg (&reset_hung_task, 0)) Continue;check_hung_ Uninterruptible_tasks (timeout);} return 0;}
This process first sets the priority to 0, which is the general priority, and does not affect other processes. Then go into the main loop (once every timeout time), first let the process sleep, set the sleep time to
Config_default_hung_task_timeout, can be modified by the kernel configuration options, the default value is 120s, Sleep end is awakened after the determination of the atomic variable identification Reset_hung_task, if set to skip this round of monitoring, and will clear the identity. The identity is set through the Reset_hung_task_detector () function (currently there is no other program in the kernel that uses the interface):
void Reset_hung_task_detector (void) {Atomic_set (&reset_hung_task, 1);} EXPORT_SYMBOL_GPL (Reset_hung_task_detector);
The next loop is the monitoring function check_hung_uninterruptible_tasks (), and the function entry is the monitoring time-out.
/* * Check Whether a task_uninterruptible does not get woken up for * a really long time (+ seconds). If that happens, print out * a warning. */static void check_hung_uninterruptible_tasks (unsigned long timeout) {int max_count = Sysctl_hung_task_check_count; int batch_count = hung_task_batching;struct task_struct *g, *t;/* * If The system crashed already then all bets is off, * Do not report extra hung tasks: */if (Test_taint (taint_die) | | did_panic) return;rcu_read_lock (); for_each_process_thread (g, T) {if (!max_count--) goto unlock;if (!--batch_count) {batch_count = Hung_task_batching;if (!rcu_lock_break (g, t)) goto Unlock;} /* use "= =" to skip the task_killable tasks waiting on NFS */if (t->state = = task_uninterruptible) check_hung_task (t, ti Meout);} Unlock:rcu_read_unlock ();}
First detect whether the kernel has been die or has been panic, if it is indicated that the kernel has been crash, no further monitoring, direct return. Note here that the Did_panic identifies the hung_task_panic () position in the panic notification chain callback function in the preceding article:
Static inthung_task_panic (struct Notifier_block *this, unsigned long event, void *ptr) {did_panic = 1;return notify_done;}
Then, if there is no trigger kernel crash, then enter the monitoring process and detect all the processes in the kernel one by one (task Task), the process is in the state of RCU lock, so in order to avoid the process more than the lock time too long, here set a batch_count, Detects up to hung_task_batching processes at a time. At the same time the user can also set the maximum number of detection max_count=sysctl_hung_task_check_count, the default value is the maximum number of PID Pid_max_limit (through the SYSCTL command set).
The function calls the For_each_process_thread () function to poll all processes (task Task) in the kernel, and only the process that is in the task_uninterruptible state is timed out, calling the Check_hung_task () function, The entry is TASK_STRUCT structure and timeout time (120s):
static void Check_hung_task (struct task_struct *t, unsigned long timeout) {unsigned long switch_count = T->NVCSW + t->nivcsw;/* * Ensure the task is not frozen. * Also, skip vfork and any other user process that freezer should skip. */if (Unlikely (T->flags & (Pf_frozen | PF_FREEZER_SKIP)) return;/* * When a freshly created task is scheduled once, changes it state to * TASK_UNINTERRUPTIB LE without has ever been switched out once, it * musn ' t is checked. */if (Unlikely (!switch_count)) return;if (Switch_count! = t->last_switch_count) {T->last_switch_count = Switch_ Count;return;} Trace_sched_process_hang (t); if (!sysctl_hung_task_warnings) return;if (sysctl_hung_task_warnings > 0) Sysctl_hung _task_warnings--;
First, the count of T->NVCSW and T->NIVCSW represents the sum of the number of times the process has been dispatched from its creation to the present, where T->NVCSW indicates the number of times the process has actively discarded the CPU, and T->NIVCSW indicates the number of forced preemption. The function then determines several identifiers: (1) If the process is frozen, then the detection is skipped, and (2) the number of 0 is not detected.
Next determine the number of times the process was saved from the last detection and this time is the same, if not the same as the time of the timeout (120s) period of scheduling, then update the schedule value returned, otherwise it indicates that the process has a timeout (120s) time has not been dispatched, Always in the D state. The next Trace_sched_process_hang () is not clear, then determine the Sysctl_hung_task_warnings identity, which indicates the number of times the alarm needs to be triggered, the user can also be configured by the SYSCTL command, the default value is 10, Even if current detection process has been in the D state, by default here every 2 minutes to send an alarm, a total of 10 times, and then no longer alert. Here's the alarm code:
/* Ok, the task did not get scheduled for more than 2 minutes, * complain: */pr_err ("Info:task%s:%d blocked for more t Han%ld seconds.\n ", T->comm, T->pid, timeout);p r_err (" %s%s%.*s\n ", print_tainted (), Init_utsname () release, (int) strcspn (Init_utsname ()->version, ""), Init_utsname ()->version);p r_err ("\" Echo 0 >/proc/sys/ Kernel/hung_task_timeout_secs\ "" "Disables this message.\n"); Sched_show_task (t);d ebug_show_held_locks (t); Touch_ Nmi_watchdog ();
The name of the deadlock task, PID number, time-out, kernel tainted information, sysinfo, kernel stack barktrace, and register information are printed in the console and log. If debug lock is turned on, the print lock is occupied and touch Nmi_watchdog to prevent Nmi_watchdog timeout (no need to consider nmi_watchdog for my arm environment).
if (sysctl_hung_task_panic) {trigger_all_cpu_backtrace ();p anic ("hung_task:blocked Tasks");}
Finally, if the Sysctl_hung_task_panic identity is set, the panic is triggered directly (this value can also be set through the kernel profile configuration or through Sysctl).
Second, sample demonstration
Demo environment: Raspberry Pi B (Linux 4.1.15)
1. First confirm the kernel configuration option to confirm the hung Stak mechanism
Kernel Hacking--->
Debug lockups and hangs--->
[*] Detect Hung Tasks
(+) Default timeout for hung task detection (in seconds)
2. Write test procedures
#include <linux/module.h> #include <linux/kernel.h> #include <linux/init.h> # Include <linux/mutex.h>define_mutex (dlock), static int __init dlock_init (void) {Mutex_lock (&dlock); mutex_ Lock (&dlock); return 0;} static void __exit dlock_exit (void) {return;} Module_init (dlock_init); Module_exit (dlock_exit); Module_license ("GPL");
This sample program defines a mutex lock and then repeats the lock in the module's init function, causing the human deadlock phenomenon (the Mutex_lock () function calls __mutex_lock_slowpath () to set the process to Task_ Uninterruptible status), the process enters the D state and cannot be exited. The PS command can be used to view:
[Email protected]:~# busybox PS
PID USER Time COMMAND
......
521 Root 0:00 insmod Dlock.ko
......
Then look at the status of the process and see that it has entered the D state
[Email protected]:~# cat/proc/521/status
Name:insmod
STATE:D (disk sleep)
tgid:521
ngid:0
pid:521
At this point after waiting two minutes after the debug serial port will output the following information, visible every two minutes will be output:
[360.625466]Info:task insmod:521 blocked for more than-seconds.
[360.631878] tainted:g O 4.1.15 #5
[360.637042] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360.644986] [<c05278e8>] (__schedule) from [<c0527d34>] (SCHEDULE+0X40/0XA4)
[360.652129] [<c0527d34>] (schedule) from [<c0527ec8>] (schedule_preempt_disabled+0x18/0x1c)
[360.660570] [<c0527ec8>] (schedule_preempt_disabled) from [<c0529200>] (__mutex_lock_slowpath+0x6c/ 0xe4
[360.670142] [<c0529200>] (__mutex_lock_slowpath) from [<c05292bc>] (mutex_lock+0x44/0x48)
[360.678432] [<c05292bc>] (Mutex_lock) from [<bf026020>] (dlock_init+0x20/0x2c [Dlock])
[360.686480] [<bf026020>] (Dlock_init [Dlock]) from [<c0009558>] (Do_one_initcall+0x90/0x1e8)
[360.694976] [<c0009558>] (Do_one_initcall) from [<c007ac4c>] (DO_INIT_MODULE+0X6C/0X1C0)
[360.703170] [<c007ac4c>] (do_init_module) from [<c007c568>] (LOAD_MODULE+0X1690/0X1D34)
[360.711284] [<c007c568>] (load_module) from [<c007cce8>] (sys_init_module+0xdc/0x130)
[360.719239] [<c007cce8>] (sys_init_module) from [<c000f800>] (ret_fast_syscall+0x0/0x54)
[480.725351]Info:task insmod:521 blocked for more than-seconds.
[480.731759] tainted:g O 4.1.15 #5
[480.736917] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[480.744842] [<c05278e8>] (__schedule) from [<c0527d34>] (SCHEDULE+0X40/0XA4)
[480.752029] [<c0527d34>] (schedule) from [<c0527ec8>] (schedule_preempt_disabled+0x18/0x1c)
[480.760479] [<c0527ec8>] (schedule_preempt_disabled) from [<c0529200>] (__mutex_lock_slowpath+0x6c/ 0xe4
[480.770066] [<c0529200>] (__mutex_lock_slowpath) from [<c05292bc>] (mutex_lock+0x44/0x48)
[480.778363] [<c05292bc>] (Mutex_lock) from [<bf026020>] (dlock_init+0x20/0x2c [Dlock])
[480.786402] [<bf026020>] (Dlock_init [Dlock]) from [<c0009558>] (Do_one_initcall+0x90/0x1e8)
[480.794897] [<c0009558>] (Do_one_initcall) from [<c007ac4c>] (DO_INIT_MODULE+0X6C/0X1C0)
[480.803085] [<c007ac4c>] (do_init_module) from [<c007c568>] (LOAD_MODULE+0X1690/0X1D34)
[480.811188] [<c007c568>] (load_module) from [<c007cce8>] (sys_init_module+0xdc/0x130)
[480.819113] [<c007cce8>] (sys_init_module) from [<c000f800>] (ret_fast_syscall+0x0/0x54)
[600.825353]Info:task insmod:521 blocked for more than-seconds.
[600.831759] tainted:g O 4.1.15 #5
[600.836916] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[600.844865] [<c05278e8>] (__schedule) from [<c0527d34>] (SCHEDULE+0X40/0XA4)
[600.852005] [<c0527d34>] (schedule) from [<c0527ec8>] (schedule_preempt_disabled+0x18/0x1c)
[600.860445] [<c0527ec8>] (schedule_preempt_disabled) from [<c0529200>] (__mutex_lock_slowpath+0x6c/ 0xe4
[600.870014] [<c0529200>] (__mutex_lock_slowpath) from [<c05292bc>] (mutex_lock+0x44/0x48)
[600.878303] [<c05292bc>] (Mutex_lock) from [<bf026020>] (dlock_init+0x20/0x2c [Dlock])
[600.886339] [<bf026020>] (Dlock_init [Dlock]) from [<c0009558>] (Do_one_initcall+0x90/0x1e8)
[600.894835] [<c0009558>] (Do_one_initcall) from [<c007ac4c>] (DO_INIT_MODULE+0X6C/0X1C0)
[600.903023] [<c007ac4c>] (do_init_module) from [<c007c568>] (LOAD_MODULE+0X1690/0X1D34)
[600.911133] [<c007c568>] (load_module) from [<c007cce8>] (sys_init_module+0xdc/0x130)
[600.919059] [<c007cce8>] (sys_init_module) from [<c000f800>] (ret_fast_syscall+0x0/0x54)
Iii. Summary
D-State deadlocks are generally common in the driver development process, and are less easily positioned, and the kernel provides this hung task mechanism, where developers simply grab and retain the location information for these outputs to quickly locate them.
Linux kernel Debugging technology--process D stateful deadlock detection