Functions of oprocd and hangcheck-timer in Linux
Default category: 13:38:36 read 106 comments 0 font size: LargeMediumSmall subscription
I. hangcheck-timer
From oracle9.2.0.2.0 to the latest 11.1, Oracle, we recommend that you use an I/O fencing module called hangcheck-timer when creating RAC in Linux, this module is used to monitor whether the Linux kernel of the node is hang. If the Hang stays for a long time, Oracle determines that it has an impact on the stability of the RAC node and will restart the node. this module has three parameters: hangcheck_tick, hangcheck_margin, and hangcheck_reboot. If the kernel does not respond within the total time of hangcheck-tick and hangcheck-margin, hangcheck-timer determines whether to restart the system based on the value of hangcheck_reboot. hangcheck_reboot is greater than or equal to 1, restart; 0, do not restart. In kernel 2.6, the default value is 0. Then "hangcheck: hangcheck value past margin! "Alarm information, indicating that the hangcheck-Reboot value is 1. The system should be restarted but not restarted.
Ii. oprocd
On the Linux platform, Oracle clusterware 10.2.0.4 and later versions introduce a new Oracle clusterware process monitor daemon (oprocd) process to monitor the system status and the health status of each node in the cluster, as provided in UNIX systems that do not use third-party cluster software, let's take a look at what oprocd is.
Oprocd runs together with hangcheck-timer in 10.2.0.4 on Linux. It is not associated with the hangcheck-timer module and is generated by the init. CCSD process and run with the root user. The oprocd process is locked in the memory to monitor each node in the cluster that runs on its own to detect the hardware or drive freezes on the machine, i/O fencing (which is different from the interrupt fencing function provided by SCSI ). If a machine is frozen for a long enough time, it will be evicted from the node by the cluster, it needs to force restart itself to prevent the cluster from reorganizing the lock resources on the failed nodes, failed nodes still access questionable I/O operations on shared data files. To provide this function, oprocd performs a check and then stops running (sleep). If it cannot be awakened within the expected time, oprocd restarts the local node.
Note: oprocd does not exist in a third-party cluster environment, because a third-party cluster solution that fails to pass verification on the Linux platform, therefore, oprocd will always exist in version 0.2.0.4 of Linux.
When oprocd is started, there are two parameters:
-T: timeout time. The default value is 1000, in milliseconds (oprocd_default_timeout = 1000)
-M: acceptable latency before restart, in milliseconds. Default Value: 500 (oprocd_default_margin = 500)
It is recommended to set diagwait to 13 to increase the acceptable time before restart to write more log information to the disk.
By default, the-m interval is 500:
$ PS-EFL | grep oprocd
0 s root 6444 3080 0 78 0-636-apr15? 00:00:00/bin/sh/etc/init. d/init.css D oprocd
4 S root 7255 6444 0-40--516-apr15? 00:00:00/u01/APP/crs11g/bin/oprocd run-T 1000-M 500-F
If diagwait is set to 13, the-m Time is increased by default. The following shows that after diagwait is set to 13, the-M parameter value is 10000.
$ PS-EFL | grep oprocd
0 s root 6444 3080 0 78 0-636-apr15? 00:00:00/bin/sh/etc/init. d/init.css D oprocd
4 S root 7255 6444 0-40--516-apr15? 00:00:00/u01/APP/crs11g/bin/oprocd run-T 1000-M 10000-F
3. Relationship between the two
Oprocd and hangcheck-timer run simultaneously on the Linux platform and provide different detection mechanisms. When they cause node restart, the information recorded in the system log is different:
The record "sysrq: resetting" will be recorded during restart caused by oprocd"
"Hangcheck: hangcheck is restarting the machine" is recorded during restart caused by hangcheck-timer"