List the points of knowledge about OPROCD
OPROCD is the introduction of Oracle in RAC to fencing IO
Under UNIX systems, if we do not use third-party cluster software other than Oracle, there will be OPROCD processes
Under the Linux system, only after the 10.2.0.4 version number will have the OPROCD process
Under window, there will be no OPROCD process, but there will be a orafenceservice service to implement the same functionality that is based on windows, unlike OPROCD
The OPROCD process can execute in both modes: fatal and no fatal, in fatal mode, assuming that the system is stuck, or otherwise triggering OPROCD, the OPROCD process will voluntarily restart the server itself. In no fatal mode, assuming that the system is stuck or otherwise triggering the OPROCD process, the OPROCD process logs a warning message in the log, but does not restart the system.
The OPROCD process has two parameters: timeout Specifies the time interval at which the OPROCD process call is specified, and if the time skew exceeds margin, the OPROCD process restarts the system or logs an error message to the log.
The log file for the OPROCD process is located at:/etc/oracle/oprocd or/VAR/OPT/ORACLE/OPROCD
The OPROCD process derives from the CSSD process and is consented to as the root user
[Email protected] init.d]# Ps-ef | grep oprocdroot 5109 11227 0 20:37 pts/0 00:00:00 grep oprocdroot 5758 4849 0 19:14? 00:00:00/BIN/SH/ETC/INIT.D/INIT.CSSD oprocdroot 6084 5758 0 19:14? 00:00:00/u01/app/crs_home/bin/oprocd.bin run-t 1000-m 10000-hsi 5:10:50:75:90-f
Assuming that a node has been stuck for a very long time, then the other nodes in the cluster will reject the node, in which case we need to take steps to restart the stuck node in order to achieve the purpose of fencing IO. OPROCD is set to two parameters: timeout and margin, the process will be awakened every timeout time, assuming that the time to wake up with the last wake up time interval more than Timeout+margin, then the OPROCD process will feel that Oracle The node is stuck, so it proactively restarts the node itself or writes the warning message to the log.
Typically, we are able to classify the OPROCD process as a four-class reboot:
1:: Operating system scheduling issues
2: The operating system has hardware or driver problems
3: The system has a lot of load, causing the scheduler to not be in time to transfer into the OPROCD process
4:oracle Bug
Bug 5015469–oprocd may reboot the node whenever the system date is moved
Backwards.
Fixed in 10.2.0.3+
Fixed in 10.1.0.3 + one off patch for Bug 4206159.
Fixed in 10.2.0.4+
Fixed in 10.2.0.3+
Bug 4206159–oprocd is prone-to-time regression due to current API used (AIX-only)
Diagnostic Fixes (VERY necessary in most CASES):
Bug 5137401–OPROCD logfile is cleared after a reboot
Bug 5037858–increase The warning levels if a reboot is approaching
The two parameters of the OPROCD process: timeout and margin, whose default value is specified in the INIT.CSSD file, as
[email protected] init.d]# Cat INIT.CSSD | grep ^oprocd_default_oprocd_default_timeout=1000oprocd_default_margin=500oprocd_default_historgram=
Therefore, by default, the system is restarted if the interval of two wake-up OPROCD processes exceeds the 1.5S,OPROCD process. This is often inappropriate, assuming that we manually change the default values in the INIT.CSSD file to require Oracle support talent.
Given the need to break the 1.5s limit, we can call INIT.CSSD to achieve the purpose, by calling INIT.CSSD can change two parameters: Reboottime and diagwait, assuming diagwait> reboottime, then margin =diagwait-reboottime. When setting up diagwait, it is necessary to stop all the processes of all nodes in the cluster, all of which can cause data corruption, only one node in the RAC can be changed. It is recommended to change diagwait to 13
[[email protected] bin]#./CRSCTL get CSS Reboottime3[[email protected] bin]#./CRSCTL Get CSS Diagwait13[[email protected ] bin]#./crsctl Set CSS diagwait 13-force
After 11.2.0.1, we no longer need to change the diagwait, so the architecture has changed.
Under Windows We can also change the diagwait, but unlike under Linux, change diagwait will not cause the above changes.
The following is a look at the information about Hangcheck_timer, Hangcheck_timer and OPROCD can achieve the same function, but there is no connection between the two
Hangcheck-timer Module
Hangcheck-timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
Starting in Release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the Hangcheck -timer module. This module is implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer is subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and a Bove.
Hangcheck-timer should is loaded at boot time, and monitors the Linux kernel for long operating system hangs that could AF Fect the reliability of a RAC node. It runs in kernel mode and uses the time Stamp Counter (TSC) to catch scheduling delays or node hangs. This is do by setting a timer, then checking if the timer fires as to whether it were delayed by more than the allowed Margin of error. If the duration exceeds the allowed time of (Hangcheck_tick + hangcheck_margin seconds), the machine is restarted. Hangcheck-timer won't cause reboots to occur due to CPU starvation.
Hangcheck-timer requires three configuration parameters:
Hangcheck_tick-defines how often, in seconds, the Hangcheck-timer checks the node for hangs. The default value is seconds.
Hangcheck_margin-defines How much margin was allowed, in seconds, between expected scheduling and real scheduling time. The default value is seconds.
Hangcheck_reboot-determines If the Hangcheck-timer restarts the node if the kernel fails to respond within the sum of th e Hangcheck_tick and Hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1 and then the Hangcheck-timer module restarts the system. If the Hangcheck_reboot parameter is set to zero and then the Hangcheck-timer module won't reboot the node, even if a hang is detected. The default value varies by kernel version. The 2.4 kernel, the default is 1. In 2.6 kernels, the default is 0.
Hangcheck-timer would provide message logging to the system messages log when a failure was detected, and a node restart is Initiated by the module:
When Hangcheck-timer reboots it may leave "Hangcheck:hangcheck are restarting the machine" message in/var/log/messages
If you see the following message in/var/log/messages: "Hangcheck:hangcheck value past margin!" This means a reboot was Required but is not performed, because Hangcheck_reboot is not set to 1. If This message was seen, you must reload the Hangcheck module as described earlier in this note, with the Hangcheck_reboot Value set to 1.
Note:hangheck timer isn't required starting with Oracle Clusterware 11gR2
Further discussion on Oracle CPROCD process