How to restart the system after a Linux kernel thread deadlock or an endless loop

Source: Internet
Author: User

When developing a kernel module or driver, if processing errors cause deadlocks or endless loops in the kernel thread, you will find that you have nothing to do except restart. At this time, your input does not play any role. The terminal (not the remote SSH tool) will only repeatedly output similar "Bug: Soft lockup-CPU #0 stuck for 67 s! [Fclustertool: 2043] ". What's more helpless is that the stack information that causes the system to be suspended after you restart is invisible. All you can do is add debugging information over and over again, restart the machine over and over again (this is my experience and it's silly to think about it now ).
This is certainly not the first problem, so the kernel will certainly provide some mechanisms to handle this situation. But how can we find out where these mechanisms are, or what information should we use to Google? The most useful is this sentence: "bug: Soft lockup-CPU #0 stuck for 67 s! [Fclustertool: 2043] ", because this sentence provides a large amount of information. First, this information can be output, indicating that code can still be executed even if a deadlock or an endless loop occurs. Second, you can find the corresponding processing function through this log information. The module where this function is located is used to process excessive CPU usage. Therefore, we can see that all the words printed by the kernel may become the key to solving the problem. We must pay attention to the information and find useful things.

I often see that the kernel version is the official 2.6.32 kernel. In this version, the function I found is softlockup_tick (), which is called in the handler run_local_timers () for clock interruption. This function first checks whether the watchdog thread is suspended. If it is not a watchdog thread, it checks whether the time occupied by the current CPU thread exceeds the threshold value configured by the system, that is, softlockup_thresh. If the CPU usage is too long, the log we see above will be output in the system log. The next step is to output module information, register information, and stack information and check whether the softlockup_panic value is 1. If softlockup_panic is 1, call panic () to suspend the kernel and output oops information. The Code is as follows:

/** This callback runs from the timer interrupt, and checks* whether the watchdog thread has hung or not:*/void softlockup_tick(void){    int this_cpu = smp_processor_id();    unsigned long touch_timestamp = per_cpu(touch_timestamp, this_cpu);    unsigned long print_timestamp;    struct pt_regs *regs = get_irq_regs();    unsigned long now;    ......    /* Warn about unreasonable delays: */    if (now <= (touch_timestamp + softlockup_thresh))        return;    per_cpu(print_timestamp, this_cpu) = touch_timestamp;    spin_lock(&print_lock);    printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n",            this_cpu, now - touch_timestamp,            current->comm, task_pid_nr(current));    print_modules();    print_irqtrace_events(current);    if (regs)        show_regs(regs);    else        dump_stack();    spin_unlock(&print_lock);    if (softlockup_panic)        panic("softlockup: hung tasks");}

However, the default value of softlockup_panic is 0. Therefore, when a deadlock or an endless loop occurs, only log information is output without downtime. This is really a pitfall! Therefore, you need to manually modify the value of/proc/sys/kernel/softlockup_panic so that the kernel can crash during deadlocks or endless loops. If kdump is installed on your machine, you will get a core file of the kernel after the restart. It is much easier to find problems from the core file, and you no longer need to manually restart the machine. If your kernel is a standard kernel, you can modify the timeout threshold by modifying/proc/sys/kernel/softlockup_thresh. If it is a centos kernel, the corresponding file is/proc/sys/kernel/watchdog_thresh. There is another difference between the centos kernel and the standard kernel, that is, the function that processes CPU usage for too long. In centos, The watchdog_timer_fn () function is used.
Here we will introduce the concept of lockup. Lockup is divided into soft lockup and hard lockup. Soft lockup refers to a process that has a bug in the kernel and leads to a loop of more than 10 s in kernel mode (depending on implementation and configuration). Other processes cannot run properly. Hard softlockup indicates that the kernel has been suspended and detailed information can be obtained through a mechanism such as watchdog. These two concepts are similar. For more information about lockup, refer to this document:
Http://www.mjmwired.net/kernel/documentation/lockup-watchdogs.txt.

Note that all the above mentioned items are valid in the kernel thread and are useless to the dead loop of the user State. If you want to monitor the dead loop of user State or resources such as insufficient memory, we strongly recommend watchdog at the software level. You can develop a Monitoring Program Based on Soft Watchdog, or install the watchdog package by modifying the configuration. This is very convenient. For specific operations, refer to the following articles, which are well written and very practical:

Http://purplegrape.blog.51cto.com/1330104/1131910

Http://www.ibm.com/developerworks/cn/linux/l-cn-watchdog/index.html#resources

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.