Functions of oprocd and hangcheck-timer in Linux

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Default category: 13:38:36 read 106 comments 0 font size: LargeMediumSmall subscription

I. hangcheck-timer

From oracle9.2.0.2.0 to the latest 11.1, Oracle, we recommend that you use an I/O fencing module called hangcheck-timer when creating RAC in Linux, this module is used to monitor whether the Linux kernel of the node is hang. If the Hang stays for a long time, Oracle determines that it has an impact on the stability of the RAC node and will restart the node. this module has three parameters: hangcheck_tick, hangcheck_margin, and hangcheck_reboot. If the kernel does not respond within the total time of hangcheck-tick and hangcheck-margin, hangcheck-timer determines whether to restart the system based on the value of hangcheck_reboot. hangcheck_reboot is greater than or equal to 1, restart; 0, do not restart. In kernel 2.6, the default value is 0. Then "hangcheck: hangcheck value past margin! "Alarm information, indicating that the hangcheck-Reboot value is 1. The system should be restarted but not restarted.

Ii. oprocd

On the Linux platform, Oracle clusterware 10.2.0.4 and later versions introduce a new Oracle clusterware process monitor daemon (oprocd) process to monitor the system status and the health status of each node in the cluster, as provided in UNIX systems that do not use third-party cluster software, let's take a look at what oprocd is.

Oprocd runs together with hangcheck-timer in 10.2.0.4 on Linux. It is not associated with the hangcheck-timer module and is generated by the init. CCSD process and run with the root user. The oprocd process is locked in the memory to monitor each node in the cluster that runs on its own to detect the hardware or drive freezes on the machine, i/O fencing (which is different from the interrupt fencing function provided by SCSI ). If a machine is frozen for a long enough time, it will be evicted from the node by the cluster, it needs to force restart itself to prevent the cluster from reorganizing the lock resources on the failed nodes, failed nodes still access questionable I/O operations on shared data files. To provide this function, oprocd performs a check and then stops running (sleep). If it cannot be awakened within the expected time, oprocd restarts the local node.

Note: oprocd does not exist in a third-party cluster environment, because a third-party cluster solution that fails to pass verification on the Linux platform, therefore, oprocd will always exist in version 0.2.0.4 of Linux.

When oprocd is started, there are two parameters:
-T: timeout time. The default value is 1000, in milliseconds (oprocd_default_timeout = 1000)
-M: acceptable latency before restart, in milliseconds. Default Value: 500 (oprocd_default_margin = 500)

It is recommended to set diagwait to 13 to increase the acceptable time before restart to write more log information to the disk.
By default, the-m interval is 500:
$ PS-EFL | grep oprocd
0 s root 6444 3080 0 78 0-636-apr15? 00:00:00/bin/sh/etc/init. d/init.css D oprocd
4 S root 7255 6444 0-40--516-apr15? 00:00:00/u01/APP/crs11g/bin/oprocd run-T 1000-M 500-F

If diagwait is set to 13, the-m Time is increased by default. The following shows that after diagwait is set to 13, the-M parameter value is 10000.
$ PS-EFL | grep oprocd
0 s root 6444 3080 0 78 0-636-apr15? 00:00:00/bin/sh/etc/init. d/init.css D oprocd
4 S root 7255 6444 0-40--516-apr15? 00:00:00/u01/APP/crs11g/bin/oprocd run-T 1000-M 10000-F

3. Relationship between the two

Oprocd and hangcheck-timer run simultaneously on the Linux platform and provide different detection mechanisms. When they cause node restart, the information recorded in the system log is different:
The record "sysrq: resetting" will be recorded during restart caused by oprocd"
"Hangcheck: hangcheck is restarting the machine" is recorded during restart caused by hangcheck-timer"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Functions of oprocd and hangcheck-timer in Linux

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support