[Switch] RAC concepts and principles
Some special problems in a cluster environment
1.1 Concurrency Control
In a cluster environment, key data is usually shared and stored, for example, on a shared disk. Each node has the same access permission to the data. In this case, there must be a mechanism to control the node's access to the data. Oracle RAC uses the DLM (Distribute Lock Management) mechanism to control concurrency among multiple instances.
1.2 Amnesia (Amnesia)
The Cluster Environment configuration file is not stored in a centralized manner, but each node has a local copy. When the cluster runs normally, you can change the cluster configuration on any node, in addition, this change will be automatically synchronized to other nodes.
There is A special situation: node A is shut down normally, modify the configuration on Node B, close node A, and Start Node B. In this case, the modified configuration file is lost, which is called amnesia.
1.3 Split Brain)
In a cluster, nodes learn the health status of each other through a certain mechanism (Heartbeat) to ensure that each node coordinates the work. Assume that only the "Heartbeat" node is faulty and each node is still running normally. At this time, each node considers other nodes to be down, you are the "only creator" in the entire cluster environment, and you should gain "control" of the entire cluster ". In the cluster environment, storage devices are shared, which means data disaster. This situation is split-brain"
The usual solution to this problem is to use the voting Algorithm (Quorum Algorithm). Its Algorithm mechanism is as follows:
Each node in the cluster needs a heartbeat mechanism to notify each other of the "Health Status". If each node receives a "notification", it means one vote. For clusters with three nodes, each node has three tickets during normal operation. When the heartbeat of node A fails but node A is still running, the entire cluster is split into two small partitions. Node A is one, and the remaining two are one. This requires removing a partition to ensure the healthy operation of the cluster.
For A cluster with three nodes, after A heartbeat problem occurs, B and C are A partition with two votes and A has only one vote. According to the voting algorithm, clusters composed of B and C obtain control, and A is removed.
If there are only two nodes, the voting algorithm becomes invalid. Because each node has only one vote. In this case, you need to introduce the third Device: Quorum Device. Quorum Device usually uses a shared disk, which is also called Quorum disk. This Quorum Disk also represents one ticket. When the heartbeat of the two nodes fails, the two nodes fight for the Quorum Disk ticket at the same time, and the first request to arrive will be satisfied first. Therefore, the first node to obtain Quorum Disk will receive two tickets. The other node will be removed.
1.4 IO isolation (Fencing)
When the cluster system encounters a "Split-brain" problem, we can use the "voting algorithm" to solve the problem of who obtains control of the cluster. However, this is not enough. We must ensure that the evicted node cannot operate on shared data. This is the problem to be solved by IO Fencing.
IO Fencing can be implemented in two ways: hardware and software:
Software: For storage devices that support the SCSI Reserve/Release command, use the SG command. A normal node uses the SCSI Reserve Command to "Lock" the storage device. When the faulty node finds that the storage device is locked, it knows that it has been driven out of the cluster, that is, it has encountered an exception, restart the instance to restore it to normal. This mechanism is also called Sicide (suicide). Sun and Veritas use this mechanism.
Hardware: Shoot The Other Node in the Head. This method directly operates The power switch. If one Node fails, if the Other Node can detect, A command is issued through the serial port to control the power switch of the faulty node. the faulty node is restarted by means of temporary power failure and power-on. This method requires hardware support.
2 RAC cluster
2.1 Clusterware
In a single-host environment, Oracle runs on the OS Kernel. OS Kernel manages hardware devices and provides hardware access interfaces. Oracle does not directly operate the hardware, but has an OS Kernel instead to complete the hardware call request.
In a cluster environment, storage devices are shared. The OS Kernel is designed for a single machine and can only control access between multiple processes on a single machine. If you still rely on the OS Kernel service, the coordination between multiple hosts cannot be guaranteed. In this case, an additional control mechanism needs to be introduced. In RAC, this mechanism is the Clusterware between Oracle and OS Kernel. It will intercept requests before OS Kernel, then negotiate with Clusterware on other nodes to complete the upper-layer request.
Before Oracle 10 Gb, RAC required cluster components and hardware vendors, such as SUN, HP, Veritas. oracle 10.1 introduces its own cluster products. cluster Ready Service (CRS), RAC is not dependent on any vendor's Cluster software. In Oracle 10.2, this product was renamed: Oracle Clusterware.
Therefore, we can see that there are actually two cluster environments in the entire RAC cluster. One is a cluster composed of Clusterware software and the other is a cluster composed of databases.
Composition of 2.2 Clusterware
Oracle Cluster is a separate installation package. After installation, Oracle Clusterware on each node is automatically started. The operating environment of Oracle Clusterware consists of two Disk files (OCR, Voting Disk), several processes and network elements.
2.2.1 disk files:
During running, Clusterware requires two files: OCR and Voting Disk. These two files must be stored in shared storage. OCR is used to solve the forgetful problem, and Voting Disk is used to solve the forgetful problem. Oracle recommends that you use bare devices to store these two files. Each file creates a bare device, and each bare device allocates about MB of space.
2.2.1.1 OCR
The forgetting problem is caused by the copy of configuration information on each node, and the configuration information of the modified node is not synchronized. The solution adopted by Oracle is to place the configuration file on the shared storage, which is OCR Disk.
The configuration information of the entire cluster is saved in OCR. The configuration information is saved as "Key-Value. Before Oracle 10 Gb, this file is called Server Manageability Repository (SRVM ). in Oracle 10 Gb, this part of content was redesigned and renamed again as OCR. during Oracle Clusterware installation, the installer prompts you to specify the OCR location. The location specified by the user is recorded in the/etc/oracle/ocr. Loc (Linux System) or/var/opt/oracle/ocr. Loc (Solaris System) file. In Oracle 9i RAC, The srvConfig. Loc file is equivalent. Oracle Clusterware reads OCR content from the specified position at startup.
1). OCR key
The entire OCR information is a tree structure with three major branches. They are SYSTEM, DATABASE, and CRS. There are many small branches under each branch. The information of these records can only be modified by the root user.
2) OCR process
Oracle Clusterware stores cluster configuration information in OCR, so OCR content is very important. All operations on OCR must ensure the integrity of OCR content. Therefore, when ORACLE Clusterware is running, not all nodes can operate OCR disks.
There is a copy of OCR content in the memory of each node. This copy is called OCR Cache. Each node has an OCR Process to read and write the OCR Cache, but only one node's OCR process can read and write the content in the OCR Disk. This node is called the OCR Master node. The OCR process of this node updates the OCR Cache content of the local node and other nodes.
All other processes that require OCR content, such as OCSSD and EVM, are called Client Process. These processes do not directly access the OCR Cache, but send requests like OCR Process to obtain content through OCR Process, if you want to modify the OCR content, you must also submit an application by the OCR Process of the node like the OCR process of the Master node. The Master OCR Process completes physical read/write and synchronizes the content in the OCR Cache of all nodes.
2.2.1.2 Voting Disk
The Voting Disk file is mainly used to record the node member status. When split-brain occurs, it is determined that the Partion gets control, and other Partion must be removed from the cluster. You will also be prompted to specify this location when installing Clusterware. After the installation is complete, run the following command to view the location of the Voting Disk.
$ Crsctl query css votedisk
2.2.2 terware background process
Clusterware is composed of several processes. The three most important processes are CRSD, CSSD, and EVMD. at the final stage of clusterware installation, you are required to run the root command on each node. sh script. This script will add the three processes to the startup item at the end of the/etc/inittab file, so that Clusterware will be automatically started each time the system starts, if two processes run abnormally, EVMD and CRSD, the system automatically restarts the two processes. If the CSSD process is abnormal, the system restarts immediately.
1). OCSSD
OCSSD is the most critical process of Clusterware. If an exception occurs, the system restarts. This process provides the CSS (Cluster Synchronization Service) Service. The CSS Service monitors the cluster status in real time through multiple heartbeat mechanisms and provides basic cluster services such as split-brain protection.
The CSS service has two Heartbeat mechanisms: One is through the Network Heartbeat of the private Network, and the other is through the Disk Heartbeat of the Voting Disk.
The two heartbeats have the maximum latency. For Disk Heartbeat, this latency is called IOT (I/O Timeout). For Network Heartbeat, this latency is called MC (Misscount ). These two parameters are measured in seconds. The missing time-saving IOT parameter is greater than MC. By default, these two parameters are automatically determined by Oracle and are not recommended to be adjusted. You can run the following command to view the parameter values:
$ Crsctl get css disktimeout
$ Crsctl get css misscount
Note: Except for Clusterware, this process is also required if ASM is used in a single-node environment. This process is used to support communication between ASM Instance and RDBMS Instance. If you install RAC on a node that uses ASM, you will encounter a problem: the RAC node requires only one OCSSD process and should be running in the $ CRS_HOME directory, in this case, you need to stop ASM and run $ ORACLE_HOME/bin/localcfig. sh delete deletes the previous inittab entries. When installing ASM, use this script to start OCSSD: $ ORACLE_HOME/bin/localconfig. Sh add.
2). CRSD
CRSD is the main process for achieving "HA". The Service it provides is called the CRS (Cluster Ready Service) Service.
Oracle Clusterware is a component located at the cluster layer. It must provide "High Availability service" for Application Layer resources (CRS Resource). Therefore, Oracle Clusterware must monitor these resources, intervene when these resources run abnormally, including shutting down, restarting the process or transferring services. The CRSD process provides these services.
All components that require high availability will be registered to OCR in the form of CRS Resource during configuration installation. The CRSD process determines which processes to monitor based on the content in OCR, how to monitor and solve problems. That is to say, the CRSD process is responsible for monitoring the running status of CRS resources, including starting, stopping, monitoring, and Failover resources. By default, CRS will automatically restart the resource five times. If it still fails, it will stop trying.
CRS resources include GSD (Global Serveice Daemon), ONS (Oracle Notification Service), VIP, Database, Instance, and Service. These resources are divided into two categories:
GSD, ONS, VIP, and Listener belong to the Noteapps class.
Database, Instance, and Service belong to the Database-Related Resource class.
It can be understood as follows: Nodeapps means that each node only needs one Listener. For example, each node only has one Listener, and Database-Related Resource means that these resources are Related to the Database and are not restricted by nodes, for example, a node can have multiple instances, and each instance can have multiple services.
GSD, ONS, and VIP services are created and registered to OCR when vipterware is installed and VIPCA is executed. Database, Listener, Instance, and Service are automatically or manually registered to OCR during their respective configuration processes.
3). EVMD
The EVMD process is responsible for releasing events generated by CRS ). these events can be published to customers in two ways: ONS and Callout Script. you can customize the callback script and place it in a specific directory. In this way, when some events occur, EVMD automatically scans the Directory and calls the script, this call is done through the racgevt process.
In addition to complex release events, the EVMD process serves as a bridge between the CRSD and CSSD processes. The communication between the CRS and CSS services is completed through the EVMD process.
4). RACGIMON
RACGIMON checks the health status of the database, starts, stops, and transfers services (Failover ). This process establishes a persistent connection to the database and regularly checks the specific information in the SGA. This information is regularly updated by the PMON process.
5). OPROCD
The OPROCD Process is also called Process Monitor Daemon. If it is not on a Linux platform and does not use third-party cluster software, the Process is displayed. This process is used to check the Processor Hang (CPU suspended) of the node. If the scheduling time exceeds 1.5 seconds, the node will be restarted because the CPU is abnormal. That is to say, this process provides the "IO isolation" function. Its function can also be seen from the service name OraFnceService on the Windows platform. On the Linux platform, the Hangcheck-timer module is used to implement "IO isolation.
2.3 principles and features of VIP
Oracle's TAF is built on the VIP technology. What is the difference between an IP address and a VIP address is that the IP address uses the TCP layer timeout and the VIP uses the application layer instant response. VIP is a floating IP address. When a node encounters a problem, it is automatically transferred to another node.
Suppose there is a RAC with two nodes, and each node has a VIP during normal operation. VIP1 and VIP2. when Node 2 fails, such as an abnormal relationship. RAC performs the following operations:
1). After detecting an exception on the rac2 node, CRS will trigger Clusterware reconstruction, and finally remove the rac2 node from the cluster. Node 1 will form a new cluster.
2). The RAC Failover Mechanism transfers the VIP address of Node 2 to node 1. At this time, the PUBLIC Nic of Node 1 has three IP addresses: VIP1, VIP2, and PUBLIC IP1.
3) Your connection request to VIP2 will be forwarded to node 1 by the IP layer route
4). Because VIP2 is available on node 1, all data packets pass through the routing layer, network layer, and transmission layer smoothly.
5). However, only VIP1 and public IP1 IP addresses are monitored on node 1. VIP2 is not monitored, so no program at the application layer receives this packet, and this error is immediately captured.
6) The customer segment can receive this error immediately, and then the customer segment will re-initiate a connection request to vip1.
VIP features:
1). VIP is created through VIPCA script
2). VIP is registered to OCR as the CRS Resource of Nodeapps type and maintained by CRS.
3). The VIP address is bound to the public Nic of the node. Therefore, the public Nic has two addresses.
4) When a node fails, CRS transfers the VIP address of the faulty node to another node.
5). the Listener of each node listens to the public ip address and VIP address on the public nic at the same time.
6). The client tnsnames. Ora usually configures the VIP to point to the node.
2.4 terware Log System
The auxiliary diagnosis of Oracle Clusterware can only be performed from log and trace. In addition, its log system is complex.
Alert. log:
$ ORA_CRS_HOME/log/hostname/alert. Log, which is the first choice for viewing files.
Clusterware background process log:
Crsd. Log: $ ORA_CRS_HOME/log/hostname/crsd. Log
Ocssd. Log: $ ORA_CRS_HOME/log/hostname/cssd/ocsd. Log
Evmd. Log: $ ORA_CRS_HOME/log/hostname/evmd. Log
Nodeapp log location:
$ ORA_CRS_HOME/log/hostname/racg/
Nodeapp logs, including ONS and VIP, such as ora. Rac1.ons. Log.
Tool execution log:
$ ORA_CRS_HOME/log/hostname/client/
Clusterware provides many command line tools:
For example, ocrcheck, ocrconfig, ocrdump, oifcfg, and clscfg, the logs generated by these tools are stored in this directory.
$ ORACLE_HOME/log/hostname/client/and
$ ORACLE_HOME/log/hostname/racg also has related logs.