Oracle cprocd Process

Source: Internet
Author: User

List knowledge points about oprocd

Oprocd is introduced by oracle in rac for fencing io

In unix systems, if we do not use third-party cluster software other than oracle, The oprocd process will exist.

In linux, The oprocd process is available only after 10.2.0.4.

In the window, there will be no oprocd process, but there will be an oraFenceService service, which is used to implement the same function. The Service adopts a technology based on windows, which is different from oprocd.

Oprocd processes can run in both modes: fatal and no fatal. In fatal mode, if the system hang occurs or the oprocd is triggered for other reasons, the oprocd process automatically restarts the server. In no fatal mode, if the oprocd process is triggered due to hang or other reasons, the oprocd process records the warning information in the log, but does not restart the system.

The oprocd process has two parameters: timeout specifies the time interval for calling the oprocd process. margin specifies the allowed time deviation. If the time deviation exceeds margin, the oprocd process restarts the system or records the error message to the log.

The log file of the oprocd process is located at:/etc/oracle/oprocd or/var/opt/oracle/oprocd


The oprocd process is derived from the cssd process and permitted as a root user.

[root@node2 init.d]# ps -ef | grep oprocdroot      5109 11227  0 20:37 pts/0    00:00:00 grep oprocdroot      5758  4849  0 19:14 ?        00:00:00 /bin/sh /etc/init.d/init.cssd oprocdroot      6084  5758  0 19:14 ?        00:00:00 /u01/app/crs_home/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f
If a node is hang for a long time, other nodes in the cluster will remove the node. In this case, we need to take measures to restart the node that is hang, to achieve fencing io. Oprocd is set with two parameters: timeout and margin. The process will be woken up at every interval. If the interval between the wake-up time and the last wake-up time exceeds timeout + margin, the oprocd process considers the oracle node to be hang, so the node is automatically restarted or the warning information is written into the log.

In general, we can classify the reasons for the oprocd process to restart the system into four categories:

1: Operating System Scheduling Problems

2: hardware or driver problems in the operating system

3: The system has a large amount of load, causing the scheduling program to be unable to be promptly transferred to the oprocd Process

4: oracle bug

Bug 5015469-OPROCD may reboot the node whenever the system date is moved

Backwards.
Fixed in 10.2.0.3 +
Fixed in 10.1.0.3 + One off patch for Bug 4206159.
Fixed in 10.2.0.4 +
Fixed in 10.2.0.3 +

Bug 4206159-Oprocd is prone to time regression due to current API used (AIX only)

Diagnostic Fixes (very necessary in most cases ):

Bug 5137401-Oprocd logfile is cleared after a reboot

Bug 5037858-Increase the warning levels if a reboot is approaching


The two parameters of the oprocd process: timeoutand margin. the timeout value is specified in the init.css d file, as shown in figure

[root@node2 init.d]# cat init.cssd | grep ^OPROCD_DEFAULT_OPROCD_DEFAULT_TIMEOUT=1000OPROCD_DEFAULT_MARGIN=500OPROCD_DEFAULT_HISTORGRAM=
Therefore, by default, if the interval between two wake-up oprocd processes exceeds 1.5 s, the oprocd process restarts the system. If you manually modify the default value in the init.css d file, you need oracle support.

If limit breaks the s limit, we can use init.cssdto implement the goal. By using init.css d, we can modify two parameters: reboottime and diagwait. If diagwait> reboottime, then margin = diagwait-reboottime. When you set diagwait, you need to stop all processes on all nodes in the cluster, which can cause data corruption. You only need to modify the process on one node in rac. We recommend that you change diagwait to 13.

[root@node2 bin]# ./crsctl get css reboottime3[root@node2 bin]# ./crsctl get css diagwait13[root@node2 bin]# ./crsctl set css diagwait 13 -force
After 11.2.0.1, we no longer need to modify diagwait, so the architecture has changed.

In windows, we can also modify diagwait. However, unlike in linux, modifying diagwait will not cause the above changes.

Next, let's take a look at the information about hangcheck_timer. hangcheck_timer and oprocd can implement the same functions, but there is no necessary relationship between them.

Hangcheck-Timer Module
Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
Starting in release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. this module was implemented to replace the Watchdog module, which provided similar fencing functionality. hangcheck-timer was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above.
Hangcheck-timer shocould be loaded at boot time, and monitors the Linux kernel for long operating system hangs that cocould affect the reliability of a RAC node. it runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs. this is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error. if the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted. hangcheck-timer will not cause reboots to occur due to CPU starvation.
Hangcheck-timer requires three configuration parameters:
Hangcheck_tick-defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
Hangcheck_margin-defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
Hangcheck_reboot-determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. if the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. if the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected. the default value varies by kernel version. in the 2.4 kernel, the default is 1. in 2.6 kernels, the default is 0.
Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:
When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in/var/log/messages
If you see the following message in/var/log/messages: "Hangcheck: hangcheck value past margin! "This means a reboot was required but was not completed MED, because hangcheck_reboot was not set to 1. if this message is seen, you must reload the hangcheck module as described earlier in this note, with the hangcheck_reboot value set to 1.
Note: Hangheck timer is not required starting with Oracle Clusterware 11gR2







Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.