List the points of knowledge about OPROCD
OPROCD is the introduction of Oracle in RAC to fencing IO
Under UNIX systems, there is a OPROCD process if we do not use third-party cluster software other than Oracle
Under the Linux system, the OPROCD process will only be available after version 10.2.0.4
Under window, there will be no OPROCD process, but there will be a orafenceservice service to implement the same functionality, which is based on windows, unlike OPROCD
The OPROCD process can run in both modes: fatal and no fatal, in fatal mode, the OPROCD process automatically restarts the server if the system is stuck, or if other reasons trigger OPROCD. In no fatal mode, if the system is stuck, or if other reasons trigger the OPROCD process, the OPROCD process logs a warning message in the log, but does not restart the system.
The OPROCD process has two parameters: timeout Specifies the time interval margin specified by the OPROCD process call, and if the time skew exceeds margin, the OPROCD process restarts the system or logs an error message to the log.
The log file for the OPROCD process is located at:/etc/oracle/oprocd or/VAR/OPT/ORACLE/OPROCD
The OPROCD process derives from the CSSD process and is allowed as root
[Email protected] init.d]# Ps-ef | grep oprocdroot 5109 11227 0 20:37 pts/0 00:00:00 grep oprocdroot 5758 4849 0 19:14? 00:00:00/BIN/SH/ETC/INIT.D/INIT.CSSD oprocdroot 6084 5758 0 19:14? 00:00:00/u01/app/crs_home/bin/oprocd.bin run-t 1000-m 10000-hsi 5:10:50:75:90-f
If a node is stuck for a long time, then the other nodes in the cluster will remove the node, in which case we need to take steps to restart the stuck node in order to achieve the purpose of fencing IO. OPROCD is set to two parameters: timeout and margin, the process will be awakened every timeout time, if the time to wake up with the last wake-up time interval of more than Timeout+margin, then the OPROCD process will consider the Oracle The node is stuck, so the node is automatically restarted or a warning message is written to the log.
In general, we can classify the OPROCD process as a four class for the reason of rebooting the system:
1:: Operating system scheduling issues
2: The operating system has hardware or driver problems
3: The system has a lot of load, causing the scheduler to not be in time to transfer into the OPROCD process
4:oracle Bug
Bug 5015469–oprocd may reboot the node whenever the system date is moved
Backwards.
Fixed in 10.2.0.3+
Fixed in 10.1.0.3 + one off patch for Bug 4206159.
Fixed in 10.2.0.4+
Fixed in 10.2.0.3+
Bug 4206159–oprocd is prone-to-time regression due to current API used (AIX-only)
Diagnostic Fixes (VERY necessary in most CASES):
Bug 5137401–OPROCD logfile is cleared after a reboot
Bug 5037858–increase The warning levels if a reboot is approaching
The two parameters of the OPROCD process: timeout and margin, whose default value is specified in the INIT.CSSD file, as
[email protected] init.d]# Cat INIT.CSSD | grep ^oprocd_default_oprocd_default_timeout=1000oprocd_default_margin=500oprocd_default_historgram=
Therefore, by default, if the time interval of two wake-up OPROCD processes exceeds the 1.5S,OPROCD process, the system restarts. This is often inappropriate and requires Oracle support if we manually modify the default values in the INIT.CSSD file.
If you need to break the 1.5s limit, we can call INIT.CSSD to achieve the purpose, by calling INIT.CSSD can modify two parameters: Reboottime and diagwait, if diagwait> reboottime, then margin =diagwait-reboottime. When setting up diagwait, all the processes of all nodes in the cluster need to be stopped, all of which can cause data corruption, just one node in the RAC to modify. It is recommended that diagwait be modified to 13
[[email protected] bin]#./CRSCTL get CSS Reboottime3[[email protected] bin]#./CRSCTL Get CSS Diagwait13[[email protected ] bin]#./crsctl Set CSS diagwait 13-force
After 11.2.0.1, we no longer need to modify the diagwait, so the schema has changed.
Under Windows We can also modify the diagwait, but unlike under Linux, modifying diagwait does not cause the above changes.
The following is a look at the information about Hangcheck_timer, Hangcheck_timer and OPROCD can achieve the same function, but there is no inevitable link between the two
Hangcheck-timer Module
Hangcheck-timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
Starting in Release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the Hangcheck -timer module. This module is implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer is subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and a Bove.
Hangcheck-timer should is loaded at boot time, and monitors the Linux kernel for long operating system hangs that could AF Fect the reliability of a RAC node. It runs in kernel mode and uses the time Stamp Counter (TSC) to catch scheduling delays or node hangs. This is do by setting a timer, then checking if the timer fires as to whether it were delayed by more than the allowed Margin of error. If the duration exceeds the allowed time of (Hangcheck_tick + hangcheck_margin seconds), the machine is restarted. Hangcheck-timer won't cause reboots to occur due to CPU starvation.
Hangcheck-timer requires three configuration parameters:
Hangcheck_tick-defines how often, in seconds, the Hangcheck-timer checks the node for hangs. The default value is seconds.
Hangcheck_margin-defines How much margin was allowed, in seconds, between expected scheduling and real scheduling time. The default value is seconds.
Hangcheck_reboot-determines If the Hangcheck-timer restarts the node if the kernel fails to respond within the sum of th e Hangcheck_tick and Hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1 and then the Hangcheck-timer module restarts the system. If the Hangcheck_reboot parameter is set to zero and then the Hangcheck-timer module won't reboot the node, even if a hang is detected. The default value varies by kernel version. The 2.4 kernel, the default is 1. In 2.6 kernels, the default is 0.
Hangcheck-timer would provide message logging to the system messages log when a failure was detected, and a node restart is Initiated by the module:
When Hangcheck-timer reboots it may leave "Hangcheck:hangcheck are restarting the machine" message in/var/log/messages
If you see the following message in/var/log/messages: "Hangcheck:hangcheck value past margin!" This means a reboot was Required but is not performed, because Hangcheck_reboot is not set to 1. If This message was seen, you must reload the Hangcheck module as described earlier in this note, with the Hangcheck_reboot Value set to 1.
Note:hangheck timer isn't required starting with Oracle Clusterware 11gR2