RAC special problems and combat experience (V)
Overview: write down the original intent and motivation of this document, derived from the Oracle Basic operating Manual of the previous article. The Oracle Basic Operations Manual is a summary of the author's vacation on Oracle's foundational knowledge learning. Then form a summary of the system, a review review, the other is easy to use. This document is also derived from this text. Before reading the Oracle RAC installation and use tutorial, I will first comb the whole idea and formation of this article. Due to the different levels of knowledge storage for readers, I will start with the preparation and planning of Oracle RAC before installing the deployment of Oracle RAC. Began with the guidance of Dr. Tang, the database cluster configuration installation, before and after 2, 3 months of groping. A lot of problems are encountered in the middle. This document will also be documented. This article original/finishing, reproduced please mark the original source:RAC Special problems and practical experience (V)
Bai Ningsu July 18, 2015 10:28:41
Shared storage
When a LUN (logical unit number) needs to be mapped to multiple nodes and a shared storage volume is provided for the cluster, the same storage LUN must be lunid on each host side. Like what:
(i) When creating a VMFS volume for multiple ESX nodes
(ii) When creating shared storage in a dual-machine HA cluster
Time consistency
Cluster mode, each node to work together, so the time of each host must be consistent. Therefore, the time of each host must be the same. The time difference between the nodes can not time out, generally if more than 30s, the node is likely to restart, so to synchronize the times of each node. For example, you need to configure an NTP clock server to synchronize time for each node of the RAC. Or the time synchronization between nodes to ensure the time synchronization of each node, but the time of the RAC database is not guaranteed to be accurate.
Internet (or private network, heart jumper)
The cluster must rely on the internal internetwork for data communication or heartbeat functions. Therefore, the use of ordinary Ethernet or other high-speed network (such as IB), it is very fastidious, of course, and take the serial line to achieve heartbeat information transmission. In addition, what network parameters are used to the overall performance and robustness of the cluster have a great relationship.
Case:
XX City, 4 node Oracle 10g RAC
The operating system uses RHEL 4, according to the default installation documentation, set the network parameters to the following values:
Net.core.rmem_default = 262144
Net.core.rmem_max = 262144
To execute a query statement, it takes 11 minutes to modify the parameters:
Net.core.rmem_default = 1048576
Net.core.rmem_max = 1048576
It takes only 16.2 seconds to execute again.
Consistency of firmware, driver, upgrade package
Case:
XX City, HPC cluster, running Ls-dyna (general display nonlinear finite element analysis program).
Environment description of the clustered storage System: 3 I/O nodes of the storage system are connected to a shared storage via the FC SAN switch.
-
- The FC HBA card used by the node is Qlogic QLE2460;
- Fiber switch for Brocade 200E
- Disk array column Dawning DS8313FF
Failure phenomena
Cluster arrived after the discovery of disk array and the machine directly connected to the two devices connected to the 200E switch. After the test switch iOS version of the problem caused by the inability to recognize the optical array of the fiber-optic port, contact the switch's supplier updated two IOS, the port of the disk array can be identified, but the disk array and the machine can not find the disk array. Today's test found that the HBA card firmware versions used by the three I/O nodes are inconsistent. The first fiber switch and disk array directly connected to the I/O1 firmware for v4.03.02, today, two I/O nodes are firmware for v4.06.03. After two tests, the disk array, the machine, the switch can communicate with each other normally, until tonight, no abnormal condition was found. Judging from the current situation is QLE2460 firmware for v4.03.01 the HBA card with the 200E IOS V5.3.1 conflict caused by incompatibility. As for the stability of the new HBA card firmware for the v4.06.03 and 200E IOS V5.3.1 Connection, further testing is needed.
Diagnostic processing Results
The fimware of the I/O 1-node HBA card is upgraded to v4.06.03 after connection 200E the failure to find the disk array has been resolved. is actually a problem caused by the inconsistency of the firmware version of a FCHBA card.
Shared files OCR and voting Disk
Oracle Cluster Registry (OCR): Records the configuration information of the OCR record node members, such as database, ASM, instance, listener, VIP and other CRS resources configuration information, can be stored on the bare device or the cluster file system. Voting disk: The quorum disk, the member information of the node is saved, when the number of polling disks must be configured to be odd, each node must simultaneously be able to connect more than half of the voting disk to survive. Which node members, add and delete information for the node are included for the first time.
Installation
In Oracle RAC, software is not recommended for installation on shared file systems, including Crs_home and oracle_home, especially CRS software, which is recommended for local file systems, so that you can use scrolling when you are upgrading your software and installing patches and patchset Upgrade (rolling upgrade) in a way that reduces the scheduled time of the machine. Additionally, a single point of failure is added if the software is installed on a shared file system. If you use ASM storage, you need to install ORACLE software separately for ASM, independent Oracle_home, easy to manage and maintain, such as when a bug in ASM needs to be patched, it doesn't affect RDBMS files and software.
Cerebral bifida (split brain)
In a shared storage cluster, when heartbeat is lost in a cluster, if the nodes are still operating concurrently on the shared storage, the situation in this case is catastrophic. ORACLE RAC uses a voting algorithm to solve this problem, the idea is: Each node has a vote, consider the a,b,c three nodes of the cluster situation, when a node for various reasons can not communicate with the B,c node, then this cluster is divided into two domain,a nodes into a DOMAIN, Have a vote; B,c node becomes a DOMAIN with two votes, then this situation b,c node has control over the cluster, thus the A node kicked out of the cluster, if the IO FENCING to achieve. In the case of a two-node cluster, the quorum disk is introduced, and when two nodes cannot communicate, the node that requests the first quorum disk has control over the cluster. Network problem (interconnect broken), time is inconsistent, misscount timeout, and so on, only occurs brain split, and at this time to protect the entire cluster is not affected by the problematic node, and the occurrence of brain split. Oracle is using the server fencing, which is to restart the problematic node and try to fix the problem. Of course there are many problems that cannot be repaired automatically. For example, time is inconsistent, and there is no NTP, the network cable is broken ... All these require manual intervention to fix the problem. The performance at this point is that the problematic node restarts repeatedly.
Cluster software
From oracle10g, Oracle has provided its own clustering software, called Oracle Clusterware, CRS, which is the prerequisite for installing Oraclerac, and the above third-party cluster is an option for installation. Another new feature, called ASM, can be used for the management of shared disk devices under RAC, as well as the stripe and mirroring of data files to improve performance and security (S.a.m.e:stripe and mirroreverything). No longer relies on third-party storage software to build RAC systems. In particular, the ORACLE11GR2 version no longer supports bare devices, and Oracle will fully promote ASM to completely abandon third-party cluster component support.
The heartbeat of Oracle Clusterware
Oracle Clusterware uses two heartbeat devices to verify the state of the members and ensure the integrity of the cluster.
- L ? One is the heartbeat of voting disk, the OCSSD process writes a heartbeat message to votedisk every second.
- L? The second is the heartbeat of the private Ethernet between the nodes.
Both heartbeat mechanisms have a corresponding time-out, called Misscount and Disktimeout, respectively:
- L? Misscount is used to define the timeout for heartbeat communication between nodes, in seconds;
- L. disktimeout, default 200 seconds, defines the time-out between the CSS process and the vote disk connection; ?
Reboottime, after splitting the brain and a node is kicked out, the node will be restarted within the reboottime time, and the default is 3 seconds. Use the following command to view the actual values of the above parameters:
- L # crsctl Get CSS Misscount
- L # grep Misscount $CRS _home/log/hostname/cssd/ocssd.log
In the following two scenarios, the CSS kicks out the node to ensure that the data is complete:
(i) Private Network IO time > Misscount, split brain occurs, resulting in multiple "subset groups" (Subcluster), which are voted to choose which surviving, The principle of kicking out the node is as follows: The number of nodes is inconsistent, the number of nodes is subcluster, and the nodes with the same node number are surviving.
(b) Votedisk I/O time > Disktimeout, the principle of kicking out the node is as follows: The node that loses more than half of the vote disk connection will be restarted within Reboottime. For example, there are 5 vote disk, which is restarted when a node with >=3 vote disk connection times out due to network or storage reasons. When one or two vote disk is damaged, it does not affect the operation of the cluster.
How to view the configuration of an existing system
For an existing system, the following methods can be used to confirm the heartbeat configuration of the DB instance, including the NIC name, IP address, and network protocol used.
? The simplest method can be obtained in the database background alarm log. Using Oradebug
Sql> Oradebug Setmypid
Statement processed.
Sql> Oradebug IPC
Information written to trace file.
Sql> Oradebug Tracefile_name
/oracle/admin/orcl/udump/orcl2_ora_24208.trc
Find the line that corresponds to the trace file: Socket No 7 IP 10.0.0.55 UDP 16878
? Get from the data dictionary:
Sql> select * from V$cluster_interconnects;
Sql> select * from X$ksxpia;
Heartbeat Tuning and Settings
To prevent the heartbeat network from becoming a single point of failure for the system, we can simply use the operating system-bound NIC to act as an Oracle heartbeat network, for example in AIX, we can use EtherChannel technology, assuming there are ENT0/1/2/3 four NICs in the system, we bind 2 and 3 as Heartbeat: The technologies corresponding to HPUX and Linux are called APA and bonding respectively.
Tuning UDP private network when using UDP as the communication protocol for cache fusion between DB instances, you need to adjust the relevant parameters on the operating system to improve UDP transmission efficiency and avoid errors that exceed the OS limit in larger data:
(i) UDP packet send buffer: size is usually set to greater than (Db_block_size * db_multiblock_read_count) +4k,
(ii) UDP packet receive buffer: size is usually set to 10 times times the transmit buffer;
(c) UDP buffer maximum: Set as large as possible (usually greater than 2M) and must be greater than the first two values;
Each platform corresponds to the view and modify commands as follows:
Solaris View ndd/dev/udp Udp_xmit_hiwat Udp_recv_hiwat udp_max_buf;
Modify NDD-SET/DEV/UDP Udp_xmit_hiwat 262144
NDD-SET/DEV/UDP Udp_recv_hiwat 262144
NDD-SET/DEV/UDP Udp_max_buf 2621440
AIX view No-a |egrep "Udp_|tcp_|sb_max"
Modify No-p-O udp_sendspace=262144
No-p-O udp_recvspace=1310720
No-p-O tcp_sendspace=262144
No-p-O tcp_recvspace=262144
No-p-O sb_max=2621440
Linux Viewing Files/etc/sysctl.conf
Modify Sysctl-w net.core.rmem_max=2621440
Sysctl-w net.core.wmem_max=2621440
Sysctl-w net.core.rmem_default=262144
Sysctl-w net.core.wmem_default=262144
HP-UX does not require
HP TRU64 View/sbin/sysconfig-q UDP
Modify: Edit File/etc/sysconfigtab
Inet:udp_recvspace = 65536
Udp_sendspace = 65536
Windows does not require
Reference documents
- Oracle's three highly available cluster scenarios
- Introduction to cluster Concept: The Oracle Advanced Course--theoretical textbook
- Oracle one-off RAC Survival Guide
- Oracle 11gR2 RAC Management and performance optimization
Article Navigation
- Introduction to cluster concept (i)
- Oracle Cluster Concepts and principles (ii)
- How RAC Works and related components (iii)
- Cache Fusion Technology (IV)
- RAC special problems and combat experience (V)
- ORACLE one-to-G Version 2 RAC ready for use with NFS on Linux (vi)
- ORACLE ENTERPRISE LINUX 5.7 under Database 11G RAC cluster installation (vii)
- ORACLE ENTERPRISE LINUX 5.7 Databases 11G RAC database installation (eight)
- Basic test and use of database 11G RAC under ORACLE ENTERPRISE LINUX 5.7 (ix)
Note: This article original/finishing, reproduced please mark the original source. ( the next article is the installation preparation, cluster installation, database installation, testing for the Oracle RAC Real-world environment.) For this series of key content )
"Oracle Cluster" Oracle DATABASE 11G RAC detailed tutorial on RAC special issues and practical experience (V)