再談ORACLE CPROCD進程

來源:互聯網
上載者:User

標籤:

羅列一下有關oprocd的知識點

oprocd是oracle在rac中引入用來fencing io的

在unix系統下,假設我們沒有採用oracle之外的第三方叢集軟體,才會存在oprocd進程

在linux系統下,僅僅有在10.2.0.4版本號碼後,才會具有oprocd進程

在window下,不會存在oprocd 進程,可是會存在一個oraFenceService服務,用來實現同樣的功能,該服務採用的技術是基於windows的,與oprocd不同

oprocd進程能夠執行在兩者模式下:fatal和no fatal,在fatal模式下,假設系統hang住,或者其它原因觸發oprocd則oprocd進程會自己主動重新啟動server。在no fatal模式下,假設系統hang住,或者其它原因觸發oprocd進程,則oprocd進程會在日誌中記錄警告資訊,可是不會重新啟動系統。

oprocd進程具有兩個參數:timeout 指定oprocd進程調用的時間間隔   margin 指定同意的時間偏差,假設時間偏差超過margin,則oprocd進程會重新啟動系統或者記錄錯誤資訊到日誌。

oprocd進程的記錄檔位於:/etc/oracle/oprocd  或者 /var/opt/oracle/oprocd


oprocd進程從cssd進程派生而來,而且以root使用者身份同意

[[email protected] init.d]# ps -ef | grep oprocdroot      5109 11227  0 20:37 pts/0    00:00:00 grep oprocdroot      5758  4849  0 19:14 ?        00:00:00 /bin/sh /etc/init.d/init.cssd oprocdroot      6084  5758  0 19:14 ?        00:00:00 /u01/app/crs_home/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f
假設一個節點被hang住了非常長時間,那麼叢集中的其它節點會把該節點剔除出去,在這樣的情況下,我們須要採取措施重新啟動被hang住的節點,以便達到fencing io的目的。oprocd被設定了兩個參數:timeout 和margin,進程會每間隔timeout時間被喚醒一次,假設本次被喚醒的時間與上次被喚醒的時間間隔超過timeout+margin,那麼oprocd進程會覺得oracle 節點被hang住,因此會自己主動重新啟動節點或者將警告資訊寫入日誌。

通常情況下,我們能夠將oprocd進程重新啟動系統的原因歸為四類:

1::作業系統的調度問題

2:作業系統的存在硬體或者驅動問題

3:系統具有大量負載,導致發送器無法及時調入oprocd進程

4:oracle bug

Bug 5015469 – OPROCD may reboot the node whenever the system date is moved

backwards.
Fixed in 10.2.0.3+
Fixed in 10.1.0.3 + One off patch for Bug 4206159.
Fixed in 10.2.0.4+
Fixed in 10.2.0.3+

Bug 4206159 – Oprocd is prone to time regression due to current API used (AIX only)

Diagnostic Fixes (VERY NECESSARY IN MOST CASES):

Bug 5137401 – Oprocd logfile is cleared after a reboot

Bug 5037858 – Increase the warning levels if a reboot is approaching


oprocd進程的兩個參數:timeout和margin,其預設值在init.cssd 檔案裡指定,如

[[email protected] init.d]# cat init.cssd | grep ^OPROCD_DEFAULT_OPROCD_DEFAULT_TIMEOUT=1000OPROCD_DEFAULT_MARGIN=500OPROCD_DEFAULT_HISTORGRAM=
因此,預設情況下,假設兩次喚醒oprocd進程的時間間隔超過1.5s,oprocd進程就會重新啟動系統。這往往是不合適的,假設我們手工改動init.cssd檔案裡的預設值,須要oracle support才幹夠。

假設須要突破1.5s的限制,我們能夠調用init.cssd來實現目的,通過調用init.cssd能夠改動兩個參數:reboottime  和 diagwait,假設diagwait> reboottime,那麼margin=diagwait-reboottime。在設定diagwait時,須要將叢集中全部節點的全部進程停掉,都在能夠造成資料損毀,僅僅需在rac中的一個節點改動就可以。建議將diagwait改動為13

[[email protected] bin]# ./crsctl get css reboottime3[[email protected] bin]# ./crsctl get css diagwait13[[email protected] bin]# ./crsctl set css diagwait 13 -force
在11.2.0.1後,我們不再須要改動diagwait,因此架構已經發生了改變。

在windows下我們也能夠改動diagwait,可是與在linux下不同,改動diagwait不會造成上面的變化。

以下再來看一下有關hangcheck_timer的有關資訊,hangcheck_timer與oprocd能夠實現同樣的功能,可是兩者之間沒有必定的聯絡

Hangcheck-Timer Module
Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
Starting in release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. This module was implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above.
Hangcheck-timer should be loaded at boot time, and monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node.  It runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs.  This is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error.  If the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted.  Hangcheck-timer will not cause reboots to occur due to CPU starvation.
 Hangcheck-timer requires three configuration parameters:
    hangcheck_tick - defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
    hangcheck_margin - defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
    hangcheck_reboot - determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected.   The default value varies by kernel version.  In the 2.4 kernel, the default is 1.  In 2.6 kernels, the default is 0.
Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:
    When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages
    If you see the following message in /var/log/messages:  "Hangcheck: hangcheck value past margin!" this means a reboot was required but was not performed, because hangcheck_reboot was not set to 1.  If this message is seen, you must reload the hangcheck module as described earlier in this note, with the hangcheck_reboot value set to 1.
Note : Hangheck timer is not required starting with Oracle Clusterware 11gR2







再談ORACLE CPROCD進程

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.