Further discussion on Oracle CPROCD process

Source: Internet
Author: User

List the points of knowledge about OPROCD

OPROCD is the introduction of Oracle in RAC to fencing IO

Under UNIX systems, if we do not use third-party cluster software other than Oracle, there will be OPROCD processes

Under the Linux system, only after the 10.2.0.4 version number will have the OPROCD process

Under window, there will be no OPROCD process, but there will be a orafenceservice service to implement the same functionality that is based on windows, unlike OPROCD

The OPROCD process can execute in both modes: fatal and no fatal, in fatal mode, assuming that the system is stuck, or otherwise triggering OPROCD, the OPROCD process will voluntarily restart the server itself. In no fatal mode, assuming that the system is stuck or otherwise triggering the OPROCD process, the OPROCD process logs a warning message in the log, but does not restart the system.

The OPROCD process has two parameters: timeout Specifies the time interval at which the OPROCD process call is specified, and if the time skew exceeds margin, the OPROCD process restarts the system or logs an error message to the log.

The log file for the OPROCD process is located at:/etc/oracle/oprocd or/VAR/OPT/ORACLE/OPROCD


The OPROCD process derives from the CSSD process and is consented to as the root user

[Email protected] init.d]# Ps-ef | grep oprocdroot      5109 11227  0 20:37 pts/0    00:00:00 grep oprocdroot      5758  4849  0 19:14?        00:00:00/BIN/SH/ETC/INIT.D/INIT.CSSD oprocdroot      6084  5758  0 19:14?        00:00:00/u01/app/crs_home/bin/oprocd.bin run-t 1000-m 10000-hsi 5:10:50:75:90-f
Assuming that a node has been stuck for a very long time, then the other nodes in the cluster will reject the node, in which case we need to take steps to restart the stuck node in order to achieve the purpose of fencing IO. OPROCD is set to two parameters: timeout and margin, the process will be awakened every timeout time, assuming that the time to wake up with the last wake up time interval more than Timeout+margin, then the OPROCD process will feel that Oracle The node is stuck, so it proactively restarts the node itself or writes the warning message to the log.

Typically, we are able to classify the OPROCD process as a four-class reboot:

1:: Operating system scheduling issues

2: The operating system has hardware or driver problems

3: The system has a lot of load, causing the scheduler to not be in time to transfer into the OPROCD process

4:oracle Bug

Bug 5015469–oprocd may reboot the node whenever the system date is moved

Backwards.
Fixed in 10.2.0.3+
Fixed in 10.1.0.3 + one off patch for Bug 4206159.
Fixed in 10.2.0.4+
Fixed in 10.2.0.3+

Bug 4206159–oprocd is prone-to-time regression due to current API used (AIX-only)

Diagnostic Fixes (VERY necessary in most CASES):

Bug 5137401–OPROCD logfile is cleared after a reboot

Bug 5037858–increase The warning levels if a reboot is approaching


The two parameters of the OPROCD process: timeout and margin, whose default value is specified in the INIT.CSSD file, as

[email protected] init.d]# Cat INIT.CSSD | grep ^oprocd_default_oprocd_default_timeout=1000oprocd_default_margin=500oprocd_default_historgram=
Therefore, by default, the system is restarted if the interval of two wake-up OPROCD processes exceeds the 1.5S,OPROCD process. This is often inappropriate, assuming that we manually change the default values in the INIT.CSSD file to require Oracle support talent.

Given the need to break the 1.5s limit, we can call INIT.CSSD to achieve the purpose, by calling INIT.CSSD can change two parameters: Reboottime and diagwait, assuming diagwait> reboottime, then margin =diagwait-reboottime. When setting up diagwait, it is necessary to stop all the processes of all nodes in the cluster, all of which can cause data corruption, only one node in the RAC can be changed. It is recommended to change diagwait to 13

[[email protected] bin]#./CRSCTL get CSS Reboottime3[[email protected] bin]#./CRSCTL Get CSS Diagwait13[[email protected ] bin]#./crsctl Set CSS diagwait 13-force
After 11.2.0.1, we no longer need to change the diagwait, so the architecture has changed.

Under Windows We can also change the diagwait, but unlike under Linux, change diagwait will not cause the above changes.

The following is a look at the information about Hangcheck_timer, Hangcheck_timer and OPROCD can achieve the same function, but there is no connection between the two

Hangcheck-timer Module
Hangcheck-timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
Starting in Release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the Hangcheck -timer module. This module is implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer is subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and a Bove.
Hangcheck-timer should is loaded at boot time, and monitors the Linux kernel for long operating system hangs that could AF  Fect the reliability of a RAC node.  It runs in kernel mode and uses the time Stamp Counter (TSC) to catch scheduling delays or node hangs. This is do by setting a timer, then checking if the timer fires as to whether it were delayed by more than the allowed  Margin of error.  If the duration exceeds the allowed time of (Hangcheck_tick + hangcheck_margin seconds), the machine is restarted. Hangcheck-timer won't cause reboots to occur due to CPU starvation.
Hangcheck-timer requires three configuration parameters:
Hangcheck_tick-defines how often, in seconds, the Hangcheck-timer checks the node for hangs. The default value is seconds.
Hangcheck_margin-defines How much margin was allowed, in seconds, between expected scheduling and real scheduling time. The default value is seconds.
Hangcheck_reboot-determines If the Hangcheck-timer restarts the node if the kernel fails to respond within the sum of th e Hangcheck_tick and Hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1 and then the Hangcheck-timer module restarts the system. If the Hangcheck_reboot parameter is set to zero and then the Hangcheck-timer module won't reboot the node, even if a hang   is detected.  The default value varies by kernel version.  The 2.4 kernel, the default is 1. In 2.6 kernels, the default is 0.
Hangcheck-timer would provide message logging to the system messages log when a failure was detected, and a node restart is Initiated by the module:
When Hangcheck-timer reboots it may leave "Hangcheck:hangcheck are restarting the machine" message in/var/log/messages
If you see the following message in/var/log/messages: "Hangcheck:hangcheck value past margin!" This means a reboot was  Required but is not performed, because Hangcheck_reboot is not set to 1. If This message was seen, you must reload the Hangcheck module as described earlier in this note, with the Hangcheck_reboot Value set to 1.
Note:hangheck timer isn't required starting with Oracle Clusterware 11gR2







Further discussion on Oracle CPROCD process

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.