How to restart the system after a Linux kernel thread deadlock or an endless loop

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When developing a kernel module or driver, if processing errors cause deadlocks or endless loops in the kernel thread, you will find that you have nothing to do except restart. At this time, your input does not play any role. The terminal (not the remote SSH tool) will only repeatedly output similar "Bug: Soft lockup-CPU #0 stuck for 67 s! [Fclustertool: 2043] ". What's more helpless is that the stack information that causes the system to be suspended after you restart is invisible. All you can do is add debugging information over and over again, restart the machine over and over again (this is my experience and it's silly to think about it now ).
This is certainly not the first problem, so the kernel will certainly provide some mechanisms to handle this situation. But how can we find out where these mechanisms are, or what information should we use to Google? The most useful is this sentence: "bug: Soft lockup-CPU #0 stuck for 67 s! [Fclustertool: 2043] ", because this sentence provides a large amount of information. First, this information can be output, indicating that code can still be executed even if a deadlock or an endless loop occurs. Second, you can find the corresponding processing function through this log information. The module where this function is located is used to process excessive CPU usage. Therefore, we can see that all the words printed by the kernel may become the key to solving the problem. We must pay attention to the information and find useful things.

I often see that the kernel version is the official 2.6.32 kernel. In this version, the function I found is softlockup_tick (), which is called in the handler run_local_timers () for clock interruption. This function first checks whether the watchdog thread is suspended. If it is not a watchdog thread, it checks whether the time occupied by the current CPU thread exceeds the threshold value configured by the system, that is, softlockup_thresh. If the CPU usage is too long, the log we see above will be output in the system log. The next step is to output module information, register information, and stack information and check whether the softlockup_panic value is 1. If softlockup_panic is 1, call panic () to suspend the kernel and output oops information. The Code is as follows:

/** This callback runs from the timer interrupt, and checks* whether the watchdog thread has hung or not:*/void softlockup_tick(void){    int this_cpu = smp_processor_id();    unsigned long touch_timestamp = per_cpu(touch_timestamp, this_cpu);    unsigned long print_timestamp;    struct pt_regs *regs = get_irq_regs();    unsigned long now;    ......    /* Warn about unreasonable delays: */    if (now <= (touch_timestamp + softlockup_thresh))        return;    per_cpu(print_timestamp, this_cpu) = touch_timestamp;    spin_lock(&print_lock);    printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n",            this_cpu, now - touch_timestamp,            current->comm, task_pid_nr(current));    print_modules();    print_irqtrace_events(current);    if (regs)        show_regs(regs);    else        dump_stack();    spin_unlock(&print_lock);    if (softlockup_panic)        panic("softlockup: hung tasks");}

However, the default value of softlockup_panic is 0. Therefore, when a deadlock or an endless loop occurs, only log information is output without downtime. This is really a pitfall! Therefore, you need to manually modify the value of/proc/sys/kernel/softlockup_panic so that the kernel can crash during deadlocks or endless loops. If kdump is installed on your machine, you will get a core file of the kernel after the restart. It is much easier to find problems from the core file, and you no longer need to manually restart the machine. If your kernel is a standard kernel, you can modify the timeout threshold by modifying/proc/sys/kernel/softlockup_thresh. If it is a centos kernel, the corresponding file is/proc/sys/kernel/watchdog_thresh. There is another difference between the centos kernel and the standard kernel, that is, the function that processes CPU usage for too long. In centos, The watchdog_timer_fn () function is used.
Here we will introduce the concept of lockup. Lockup is divided into soft lockup and hard lockup. Soft lockup refers to a process that has a bug in the kernel and leads to a loop of more than 10 s in kernel mode (depending on implementation and configuration). Other processes cannot run properly. Hard softlockup indicates that the kernel has been suspended and detailed information can be obtained through a mechanism such as watchdog. These two concepts are similar. For more information about lockup, refer to this document:
Http://www.mjmwired.net/kernel/documentation/lockup-watchdogs.txt.

Note that all the above mentioned items are valid in the kernel thread and are useless to the dead loop of the user State. If you want to monitor the dead loop of user State or resources such as insufficient memory, we strongly recommend watchdog at the software level. You can develop a Monitoring Program Based on Soft Watchdog, or install the watchdog package by modifying the configuration. This is very convenient. For specific operations, refer to the following articles, which are well written and very practical:

Http://purplegrape.blog.51cto.com/1330104/1131910

Http://www.ibm.com/developerworks/cn/linux/l-cn-watchdog/index.html#resources

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to restart the system after a Linux kernel thread deadlock or an endless loop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to restart the system after a Linux kernel thread deadlock or an endless loop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support