I. Official Website description
Refer to Mos:
Linux: hangcheck-timer module requirements for Oracle 9i, 10g, and11gr1 RAC [ID 726833.1]
Hangcheck_timermodule is required to run a supported configuration in Oracle Real applicationclusters environments on Linux, with Oracle Releases 9i, 10g, or 11gr1rac. this note identifies and outlines the requirements needed toconfigure hangcheck-timer in an Oracle Enterprise Linux, Red Hat Linux, or suselinux environment.
In Linux, The hangcheck_timer module must be configured for RAC of Oracle 9i, 10g, and 11gr1.
Note: hangheck timer is notrequired starting with Oracle clusterware 11gr2
Note that the module does not need to be configured in RAC of 11gr2.
Starting in release 9.2.0.2and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. this module was implemented to replace thewatchdog module, which provided similar fencing functionality. hangcheck-timerwas subsequently delivered as part of the standard kernel distribution forlinux kernel releases 2.4 and above.
From version 9.2.0.2, a new I/O fencing module, called the hangcheck-timer module, is required for the oraclerac environment. This module is used to replace the watchdog module and provides similar fencing functions. The hangcheck-timer module is a sub-function in the standard linux2.4 or later kernel.
Hangcheck-timer shouldbe loaded at boot time, and monitors the Linux kernel for long operatingsystem hangs that cocould affect the reliability of a RAC node. it runs inkernel mode and uses the time stamp counter (TSC) to catch scheduling delays ornode hangs. this is done by setting a timer, then checking when the timerfires as to whether it was delayed by more than the allowed margin oferror. if the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted. hangcheck-timer willnot cause reboots to occur due to CPU starvation.
-- Hangcheck-timer should be loaded when the system is started, and perform kernel monitoring on hang, a system operation that can affect the stability of RAC nodes for a long time. It runs at the kernel level and uses time stamp counter (TSC) to capture scheduling latency and node hang. This is done by setting a timer and then checking the fires condition of the timer to determine whether the latency exceeds the error range. If this cycle exceeds the allowed time (that is, hangcheck_tick + hangcheck_margin seconds), the machine will be restarted. If the CPU resources are insufficient, hangcheck-timer will not restart.
Hangcheck-timer requiresthree configuration parameters:
-- Hangcheck-timer has three configuration parameters:
(1) hangcheck_tick-Defines howoften, in seconds, the hangcheck-timer checks the node for hangs. The defaultvalue is 60 seconds.
-- Hangcheck_tick: defines the hangcheck-timer check node's hang frequency, in seconds. The default value is 60 seconds.
(2) hangcheck_margin-Defines howmuch margin is allowed, in seconds, between expected scheduling and realscheduling time. The default value is 180 seconds.
-- Hangcheck_margin: defines the allowable error between the expected and true scheduling. The unit is second, and the default value is 180 seconds.
(3) hangcheck_reboot-determinesif the hangcheck-timer restarts the node if the kernel fails to respond withinthe sum of the hangcheck_tick and hangcheck_margin parameter values. if thevalue of hangcheck_reboot is equal to or greater than 1, then thehangcheck-timer module restarts the system. if the hangcheck_reboot parameteris set to zero, then the hangcheck-timer module will not reboot the node, even if a Hang is detected. the default value varies by kernelversion. in the 2.4 kernel, the default is 1. in 2.6 kernels, thedefault is 0.
-- Hangcheck_reboot: defines whether hangcheck-timer restarts the node if the kernel fails to respond within the time when hangcheck-tick and hangcheck-margin are added. If the value of hangcheck_reboot is greater than or equal to 1, the hangcheck-timer module restarts the system. If it is set to 0, hangcheck-timer will not restart the system even when the system is hang. In the Linux 2.4 kernel, the default value is 1; In the 2.6 kernel, the default value is 0.
When hangcheck_reboot = 1 and meets the following formula, hangcheck-timer uses the reboot system: System hang time> (hangcheck_tick + hangcheck_margin)
All hangcheck-timer defaultvalues shoshould be explicitly overridden when loading the kernel module, based onthe Oracle release as follows:
-- The default values of all hangcheck-timer parameters must be explicitly overwritten when the kernel module is loaded. Different Oracle versions can be set as follows:
(1) 9i: Assuming thedefault setting of "memory M misscount" is set to 220 seconds:
Hangcheck_tick = 30hangcheck_margin = 180 hangcheck_reboot = 1
-- 9i: if the default setting of "Oracle misscount" is 220 seconds, hangcheck_tick = 30hangcheck_margin = 180 hangcheck_reboot = 1
(2) 10g/11gr1: Assuming thedefault setting of "CSS misscount" is set to either 30 or 60 seconds:
Hangcheck_tick = 1hangcheck_margin = 10 hangcheck_reboot = 1
-- 10g/11gr1: If "CSS misscount" is set to 30 or 60 seconds, hangcheck_tick = 1hangcheck_margin = 10 hangcheck_reboot = 1
You must always ensure thatthe cluster misscount setting is greater than the sum of the Setting forhangcheck_tick + hangcheck_margin.
-- Note: You must set the cluster's misscount value to be greater than the sum of hangcheck_tick + hangcheck_margin.
When running oracleclusterware on Linux, hangcheck-timer shoshould always be configured on each raccluster node, as the functionality of this module is required to provide I/O fencingto ensure no stray writes will occur from an evicted node in a raccluster. to verify if the hangcheck-timer module is running on a nodeexecute as the root or Oracle user:
-- For clusterware on Linux, you must configure the hangcheck-timer module on each node. You can run the following command as the root user to verify whether hangcheck-timer is running:
#/Sbin/lsmod | grep hangcheck
Hangcheck-timer 2672 0
If the hangcheck-timer moduleis loaded (running) You will see output similar to above. When hangcheck-timeris not loaded no output is generated, and the command prompt is returned to theuser.
In an Oracle Enterprise Linux, Red Hat 4/5, or SUSE 9/10 environment the hangcheck-timer module is loadedusing The modprobe command:
-- Run the following command to load hangcheck-Timer:
# Modprobe hangcheck-timer hangcheck_tick = 1 hangcheck_margin = 10hangcheck_reboot = 1
In order to ensure the moduleis loaded at boot time, you shoshould also place the same command in the appropriatelocal Command Execution directory (e.g. /etc/rc. d/RC. local, or/etc/init. d/boot. local ). in earlier releases, hangcheck-timer was loadedusing insmod in place of modprobe. consult your release specific documentationto determine which initialization method is required.
-- To ensure that the hangcheck-timer module is loaded at system startup, we can add the command to/etc/rc. d/RC. local, or/etc/init. d/boot. local.
Hangcheck-timer will providemessage logging to the system messages log when a failure is detected, and anode restart is initiated by the module:
-- When hangcheck-timer detects the system hang, it records the log in the system log and restarts the system.
(1) When hangcheck-timer reboots itmay leave "hangcheck: hangcheck is restarting the machine" message in/var/log/messages.
-- Hangcheck-timer startup information will be recorded in the system log "/var/log/messages", and "hangcheck: hangcheck is restarting the machine "information to/var/log/messages
(2) If you see the followingmessage in/var/log/messages: "hangcheck: hangcheck value pastmargin! "This means a reboot was required but was not completed MED, becausehangcheck_reboot was not set to 1. if this message is seen, you mustreload the hangcheck module as described earlier in this note, with thehangcheck_reboot value set to 1.
-- If you see "hangcheck: hangcheck value past margin" in/var/log/messages! "Message, indicating that the system needs to be restarted but not restarted, because the hangcheck-Reboot parameter is not set to 1.
Note:
Bugs: 6125546 which can preventhangcheck-timer from rebooting in RHEL4 (fixed in 2.6.9.56 or rhel4.6)
Ii. Description
Hangcheck-timer is a kernel-level io-fencing module provided by Linux. This module monitors the running status of the Linux kernel. If it is suspended for a long time, the module automatically restarts the system. This module runs in Linux kernel space and is not affected by system load. This module uses the CPU's time stamp counter (TSC) Register. The value of this register will automatically increase in each clock cycle. Therefore, it uses hardware time, so the accuracy is higher.
To configure this module, two parameters are required: hangcheck_tick and hangcheck_margin.
Hangcheck_tick is used to define the time interval for a check. The default value is 30 seconds. It is possible that the kernel itself is very busy, causing this check to be postponed. This module also allows defining a latency ceiling, namely hangcheck_margin, which defaults to 180 seconds.
The hangcheck-timer module regularly checks the kernel based on the hangcheck_tick settings. As long as the interval between two checks is less than hangcheck_tick + hangchec_margin, the kernel runs normally. Otherwise, the system is automatically restarted because the system runs abnormally.
CRS also has a misscount parameter, which can be viewed using the crsctl get CSS miscount command.
When the heartbeat information between RAC nodes is lost, clusterware must ensure that the faulty node is in the dead state during reconstruction; otherwise, the node is lost because of a temporary overload, then the reconstruction of other nodes is started, but the node is not restarted, which will damage the database. Therefore, misscount must be greater than the sum of hangcheck_tick + hangcheck_margin.
2.1 hangcheck-timer.ko module Installation
Hangcheck-timer is installed in Linux 2.4.9-E.12 by default. You can run the following command to check whether hangcheck-timer is installed.
[Root @ Rac1 ~] # Find/lib/modules-name "hangcheck-timer.ko"
/Lib/modules/2.6.18-164. EL5/kernel/Drivers/Char/hangcheck-timer.ko
/Lib/modules/2.6.18-164. el5xen/kernel/Drivers/Char/hangcheck-timer.ko
With the above output, it indicates that the installation is complete.
2.2 configure the hangcheck-timer Module
Configure the hangcheck-timer parameter and add the following content to/etc/modprobe. conf. The content varies depending on the database version.
(1) 9i: if the default value of "Oracle misscount" is 220 seconds, hangcheck_tick = 30hangcheck_margin = 180 hangcheck_reboot = 1
(2) 10g/11gr1: If "CSS misscount" is set to 30 or 60 seconds, hangcheck_tick = 1hangcheck_margin = 10 hangcheck_reboot = 1
For example:
[Root @ Rac1 ~] # Vi/etc/modprobe. conf
Options hangcheck-timer hangcheck_tick = 30hangcheck_margin = 180
2.3 Configure Automatic module loading when the system starts
Add the following content to/etc/rc. d/rc. Local:
[Root @ Rac1 ~] # Modprobe hangcheck-timer
[Root @ Rac1 ~] # Vi/etc/rc. d/rc. Local
Modprobe hangcheck-timer
2.4 check that the module is loaded successfully
[Root @ Rac1 ~] # Grep hangcheck/var/log/messages | tail-2
Sep 7 19:53:03 Rac1 kernel: hangcheck: Starting hangcheck timer 0.9.0 (tick is 180 seconds, margin is 60 seconds ).
Sep 7 19:53:03 Rac1 kernel: hangcheck: Using monotonic_clock ().
Check whether the hangcheck-timer parameter is correct
[Root @ rac2 ~] #/Sbin/lsmod | grep hangcheck
Hangcheck_timer 7897 0
Generally, the parameters of Rac are set as follows:
Misscount = 220
Hangcheck_tick = 30
Hangcheck_margin = 180
Bytes -------------------------------------------------------------------------------------------------------
All rights reserved. reprinted articles are allowed, but source addresses must be indicated by links. Otherwise, the documents will be held legally responsible!
Skype: tianlesoftware
QQ: tianlesoftware@gmail.com
Email: tianlesoftware@gmail.com
Blog: http://www.tianlesoftware.com
WEAVER: http://weibo.com/tianlesoftware
Twitter: http://twitter.com/tianlesoftware
Facebook: http://www.facebook.com/tianlesoftware
LinkedIn: http://cn.linkedin.com/in/tianlesoftware
------- Add a group to describe the relationship between Oracle tablespace and data files in the remarks section. Otherwise, reject the application ----
Dba1 group: 62697716 (full); dba2 group: 62697977 (full) dba3 group: 62697850 (full)
Super DBA group: 63306533 (full); dba4 group: 83829929 dba5 group: 142216823
Dba6 group: 158654907 dba7 group: 172855474 DBA group: 104207940