Use watchdog to build a high-availability Linux system and applications

Source: Internet
Author: User

Zhou Ting (moting9@gmail.com), software engineer, IBM China System Technology Lab

Zhou ting, a software engineer, is currently engaged in the development of server management firmware at the IBM China System Technology Lab.

Brief Introduction: Linux is widely used in different fields, such as telecommunications and terminal portable devices. Applications in different fields also demand Linux systems. Carrier Grade Linux is a Carrier-level Linux Standard released by OSDL (Open Source Development Lab). In terms of system availability, it is pointed out that Linux must support the watchdog mechanism. The Linux kernel is developed from version 1.3.51 to provide hardware and software watchdog drivers. With the development of the kernel, Linux provides extensive support for different types of hardware watchdog cards. This article first introduces the Linux kernel's support for hardware and software watchdog, and then introduces how to add watchdog to a system monitoring application through an open-source project watchdog daemon to improve system availability, and how to add the watchdog module to a Linux service application to improve the availability of the application.

1. Watchdog support in Linux

1.1 how watchdog works in Linux

Watchdog can be either a hardware circuit or a software timer, and can automatically restart the system when the system fails. In Linux, the basic working principle of watchdog is: after the watchdog is started (that is, after the/dev/watchdog device is turned on ), if no write operation is performed on/dev/watchdog within a specified interval, the hardware watchdog circuit or software timer restarts the system.

/Dev/watchdog is a device node with a master device number of 10 characters starting from device number 130. The Linux Kernel not only provides drivers for various types of watchdog hardware circuits, but also provides a timer-based pure software watchdog driver. The driver source code is located in the drivers \ char \ watchdog \ directory of the kernel source code tree.

1.2 differences between Hardware and Software watchdog

  1. The hardware watchdog must have hardware circuit support. The device node/dev/watchdog corresponds to the real physical device. Different types of hardware watchdog devices are managed by the corresponding hardware driver. The software watchdog is implemented by a kernel module softdog. ko through the timer mechanism./dev/watchdog does not correspond to real physical devices, but provides an interface that is the same as the operating hardware watchdog.
  2. The hardware watchdog has better reliability than the Software watchdog. The software watchdog is implemented based on the kernel timer. When the kernel or interrupt encounters an exception, the Software watchdog will become invalid. The hardware watchdog is controlled by its own hardware circuit and is independent of the kernel. Regardless of the current system status, the hardware watchdog will restart the system if it is not written at the specified interval.
  3. Some Hardware watchdog cards such as WDT501P and some Berkshire cards can also monitor system temperature and provide/dev/temperature interfaces.

For applications, the methods for operating software and hardware watchdog are basically the same: Enable the device/dev/watchdog and perform write operations on/dev/watchdog during the restart interval. That is, the software and hardware watchdog are basically transparent to applications.

At any time, only one watchdog driver module can be loaded to manage/dev/watchdog device nodes. If the system does not have a Hardware watchdog Circuit, you can load the Software watchdog driver softdog. ko.

1.3 watchdog configuration in Linux Kernel

Before using watchdog in Linux to develop applications, make sure that the kernel has been correctly configured to support watchdog. The drivers/char/watchdog/Kconfig file in the kernel source code provides a detailed description of Various watchdog configuration options. Note that for the configuration of the 'config _ watchdog_nowayout' option, we can see from the module information of the watchdog software in Listing 1 that the default value of the nowayout parameter is 'config _ watchdog_nowayout ', if the 'config _ watchdog_nowayout' option is set to 'y' during Kernel configuration, after the watchdog is started (that is, after/dev/watchdog is enabled ), neither the close operation nor the write character 'V' can stop the running of watchdog.


Listing 1. module information of softdog driven by Software watchdog

                linux-mach:~ # modinfo softdogfilename:       /lib/modules/2.6.16.21-0.8-smp/kernel/drivers/char/watchdog/softdog.koauthor:         Alan Coxdescription:    Software Watchdog Device Driverlicense:        GPLalias:          char-major-10-130vermagic:       2.6.16.21-0.8-smp SMP 586 REGPARM gcc-4.1supported:      yesdepends:        srcversion:     EAE9E5688843C073B0EF5BCparm:           soft_noboot:Softdog action, set to 1 to ignore reboots, 0 to reboot                     (default depends on ONLY_TESTING) (int)parm:           nowayout:Watchdog cannot be stopped once started                     (default=CONFIG_WATCHDOG_NOWAYOUT) (int)parm:           soft_margin:Watchdog soft_margin in seconds.                     (0<soft_margin<65536, default=60) (int)      

1.4 watchdog restart interval setting

Before developing an application, you must understand the restart interval of the watchdog driver. Various Hardware and Software watchdog drivers have a default restart interval, which can also be set when the driver module is loaded.

The module information of the Software watchdog in Listing 1 shows that the soft_margin parameter represents softdog. ko restart interval. The default value is 60 seconds. You can load softdog. specify the restart interval for ko, for example, 'modprobe softdog soft_margin = 661 '.

1.5 start watchdog

Various Hardware and Software watchdog drivers provide the same operation methods for applications. Turn on the/dev/watchdog device, and the watchdog will be started. If no write operation is performed on/dev/watchdog within the specified restart interval, the system restarts.


Listing 2. code snippet for starting watchdog

                int wdt_fd = -1;wdt_fd = open("/dev/watchdog", O_WRONLY);if (wdt_fd == -1){    // fail to open watchdog device}  

1.6 watchdog stopped

If the Kernel configuration option 'config _ watchdog_nowayout' is set to 'y', watchdog cannot be stopped after startup by default. If the nowayout parameter of the module is set to 0, write the 'V' character to/dev/watchdog to stop the watchdog operation. Refer to references 2. 2.6 kernel source code: write Functions of various hardware and software watchdog drivers under the drivers \ char \ watchdog directory to get the logic to stop watchdog, such as software watchdog driver softdog. the write function in c.

Reference three watchdog daemon source code watchdog-5.4.tar.gz close_all function provides an example of stopping the watchdog operation. The following is a simple example of stopping a watchdog code segment:


Listing 3. Code segment for stopping watchdog

                    if (wdt_fd != -1)    {        write(wdt_fd, "V", 1);        close(wdt_fd);        wdt_fd = -1;    }  

1.7 run watchdog

From the write function of the softdog module mentioned in section 1.6, we can see that during the watchdog restart interval, the softdog_keepalive will be called to increase the timer time.

Therefore, after the application starts watchdog, it must periodically perform write operations on/dev/watchdog during the restart interval to prevent the system from being restarted.

Reference 3 The keep_alive function of the watchdog daemon source code watchdog-5.4.tar.gz provides an example of keeping watchdog running. The following is a simple example of code snippet that keeps watchdog running:


Listing 4. code snippet that keeps watchdog running

                    if (wdt_fd != -1)        write(wdt_fd, "a", 1);
  • 1
  • 2
  • Next Page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.