Rac1--clusterware Concept Introduction 1

Source: Internet
Author: User

Some special problems in a set of group environment

1.1 Concurrency control

In a clustered environment, critical data is usually shared, such as on a shared disk. While each node has the same access rights to the data, there must be some mechanism to control the node's access to the data. Oracle RAC is the use of the DLM (distribute Lock Management) mechanism for concurrency control between multiple instances.

1.2 Amnesia (Amnesia)

The cluster environment profile is not centrally stored, but each node has a local copy, and when the cluster is functioning, the user can change the configuration of the cluster at any node, and the change is automatically synchronized to the other nodes.

There is a special case: Node A gracefully shuts down, modifies the configuration on Node B, closes node B, and initiates the Node A. In this case, the modified profile is missing, which is called Amnesia.

1.3 Cerebral bifida (split Brain)

In a cluster, nodes are aware of each other's health state through a mechanism (heartbeat) to ensure that each node coordinates its work. Assuming that only the "heartbeat" problem, the nodes are still working normally, at this point, each node is considered the other nodes down, oneself is the whole cluster environment "only built in", oneself should get the whole cluster "control." In a clustered environment, storage devices are shared, which means data disaster, which is "brain crack"

The usual way to solve this problem is to use the voting algorithm (Quorum algorithm). Its algorithm mechanism is as follows:

Each node in the cluster needs a heartbeat mechanism to communicate the "health state" of each other, assuming that each node receives a "notification" representing a vote. For a cluster of three nodes, each node will have 3 votes when it runs normally. When Node A has a heartbeat failure but nodes A is still running, the entire cluster splits into 2 small partition. Node A is one, and the remaining 2 is one. It is necessary to eliminate a partition to ensure the healthy operation of the cluster.

For clusters with 3 nodes, a heartbeat problem occurs, B and C are a partion, there are 2 votes, a only 1 votes. According to the voting algorithm, the clusters of B and C gain control and A is excluded.

If there are only 2 nodes, the voting algorithm fails. Because there are only 1 votes on each node. You need to introduce a third device: Quorum device. Quorum Device is typically a shared disk, which is also known as Quorum disk. This quorum Disk also represents a vote. When the heartbeat of the 2 nodes is in trouble, 2 nodes go for quorum Disk at the same time, and the first arrival request is satisfied first. So the first to get quorum disk node to get 2 votes. The other node will be removed.

1.4 IO Isolation (Fencing)

When the "brain crack" problem occurs in the cluster system, we can solve the problem of who gets control of the cluster by "voting algorithm".  But this is not enough, and we must also ensure that the node being evicted cannot manipulate the shared data. This is the problem that IO Fencing to solve.

IO Fencing Existing hardware and software 2 ways:

Software: For a storage device that supports SCSI reserve/release commands, it can be implemented with the SG command. The normal node uses the SCSI reserve command to "lock" the storage device, and the fault node discovers that the storage device is locked and knows that it has been evicted from the cluster, which means that it has an abnormal situation and is going to restart itself to restore to a normal state. This mechanism is also known as sicide (suicide). This mechanism is used by Sun and Veritas.

Hardware: STONITH (Shoot the other node in the Head), this way directly operate the power switch, when a node fails, the other node if it can detect, will be issued through the serial port command, control the fault node power switch, through the temporary power, And power-up means that the failed node is restarted, which requires hardware support.

Two RAC clusters

2.1 Clusterware

In a stand-alone environment, Oracle is running on top of OS Kernel. OS Kernel is responsible for managing hardware devices and providing hardware access interfaces. Oracle does not directly manipulate the hardware, but instead has an OS kernel instead of it to complete the call request to the hardware.

In a clustered environment, storage devices are shared. OS Kernel is designed for standalone, and can only control access between multiple processes on a single machine. If you also rely on OS kernel services, you cannot guarantee coordination between multiple hosts. There is a need to introduce additional control mechanisms in the RAC, which is the clusterware between Oracle and OS Kernel, which intercepts the request before the OS Kernel and then negotiates with Clusterware on the other nodes to finalize the upper-level request.

Prior to Oracle 10G, the cluster components required by RAC were dependent on the hardware vendor, such as Sun,hp,veritas. From the Oracle 10.1 release, Oracle has launched its own cluster product. Cluster ready Service (CRS), from this RAC is not dependent on the cluster software with any vendor. In Oracle version 10.2, this product was renamed: Oracle Clusterware.

So we can see that in the whole RAC cluster, there are actually 2 cluster environments, one is a cluster composed of Clusterware software, the other is a cluster composed of database.

2.2 Clusterware composition

Oracle Cluster is a separate installation package that is automatically launched by Oracle Clusterware on each node after installation. Oracle Clusterware's operating environment consists of 2 disk files (ocr,voting disks), a number of processes, and network elements.

2.2.1 Disk files:

Clusterware requires two files during Operation: OCR and voting Disk. These 2 files must be stored on the shared storage. OCR is used to solve the problem of forgetfulness, voting Disk is used to solve the problem of brain fissure. Oracle recommends using bare devices to hold these 2 files, each of which creates a bare device, and each bare device allocates about 100M of space.

2.2.1.1 OCR (Oracle Cluster Registry)

The problem with forgetfulness is that each node has a copy of the configuration information and the configuration information of the modified node is not synchronized. The solution Oracle uses is to put this configuration file on the shared storage, which is the OCR Disk.

The configuration information of the entire cluster is saved in OCR, and the configuration information is saved as "Key-value". Prior to Oracle 10g, this file was called Server Manageability Repository (SRVM). In Oracle 10g, this part was redesigned and renamed OCR. During the installation of Oracle Clusterware, the installer prompts the user to specify the OCR location. And the user-specified location is recorded in/ETC/ORACLE/OCR. Loc (Linux System) or/VAR/OPT/ORACLE/OCR. Loc (Solaris System) file. In the Oracle 9i RAC, the equivalent is the Srvconfig.loc file. Oracle Clusterware will read the OCR content from the specified location at boot time based on the content.

1). OCR Key

The entire OCR information is a tree-shaped structure with 3 large branches. are System,database and CRS respectively. There are many small branches under each branch. The information for these records can only be modified by the root user.

2) OCR Process

Oracle Clusterware Storage Cluster configuration information in OCR, so the content of OCR is very important, all the operation of OCR must ensure the integrity of the OCR content, so in the Oracle Clusterware operation, not all nodes can operate the OCR Disk .

There is a copy of the OCR content in the memory of each node, which is called the OCR Cache. Each node has an OCR process to read and write the OCR Cache, but only one node of the OCR process can read and write the contents of the OCR disk, which is called the OCR Master node. The OCR process for this node is responsible for updating the OCR cache content of local and other nodes.

All other processes that require OCR content, such as OCSSD,EVM, are called client processes that do not directly access the OCR Cache, but rather like OCR process to send requests, get content with OCR process, and if you want to modify OCR content, The OCR process of the node will also be submitted as a request by the OCR process of master node, which completes the physical reading and writing of the master OCR process and synchronizes the contents of all the nodes in the OCR cache.

2.2.1.2 voting Disk

Voting Disk This file is mainly used to record node member status, in the event of a brain fissure, the decision that partion gain control, the other partion must be removed from the cluster. You will also be prompted to specify this location when installing Clusterware. After the installation is complete, you can view the voting disk location by following the command below.

$Crsctl Query CSS Votedisk

2.2.2 Clusterware Background process

Clusterware consists of a number of processes, the most important of which are 3: CRSD (Cluster read service), CSSD (cluster Synchronization Services), EVMD (event managed Service). In the final stage of installing Clusterware, the root.sh script will be required to execute at each node, which will add the 3 processes to the startup item at the end of the/etc/inittab file, so that the Clusterware will start automatically each time the system starts. where EVMD and CRSD two processes if an exception occurs, the system will automatically restart the two processes, if the CSSD process exception, the system will restart immediately.

1). OCSSD OCSSD This process is clusterware the most critical process, and if this process is abnormal, it will cause the system to restart, this process provides the CSS (Cluster Synchronization Service) services. The CSS service monitors the status of the cluster in real time through a variety of heartbeat mechanisms, providing basic cluster service functions such as brain crack protection.

The CSS service has 2 heartbeat mechanisms: One is through the network Heartbeat of the private networks and the other through the disk Heartbeat of voting disk.

These 2 heartbeats have the maximum delay, for disk Heartbeat, this delay is known as IoT (I/O Timeout), and for network Heartbeat, this delay is called MC (Misscount). These 2 parameters are all in seconds, and by default the IoT is larger than MC, which is automatically determined by Oracle, and is not recommended. You can view the parameter values by using the following command:

$crsctl Get CSS Disktimeout

$crsctl Get CSS Misscount

Note: In addition to Clusterware, this process is required in a single-node environment where ASM is used, and this process is used to support communication between ASM Instance and RDBMS Instance. If you install RAC on a node that uses ASM, you encounter a problem: the RAC node requires only one OCSSD process, and should be running the $crs_home directory, you need to stop ASM first and pass the $oracle_home/bin/localcfig. Sh Delete Deletes the previous Inittab entry. Before installing ASM, this script is also used to start OCSSD: $ORACLE _home/bin/localconfig. Sh Add.

2) . CRSD

CRSD is the primary process for achieving high availability (HA), and the services it provides are called CRS (Cluster Ready Service) services.

Oracle Clusterware is a component at the cluster level that provides a "high availability service" for application-tier resources (CRS Resource), so Oracle Clusterware must monitor these resources and intervene when these resources run abnormally, including shutting down, Restart the process or transfer the service. These services are provided by the CRSD process.

All components that require high availability are registered to OCR in the form of CRS resource when the configuration is installed, and the CRSD process is based on the contents of OCR, deciding which processes to monitor, how to monitor them, and how to resolve problems when they occur. In other words, the CRSD process is responsible for monitoring the operating state of CRS Resource, and to start, stop, monitor, failover these resources. By default, CRS will automatically attempt to restart the resource 5 times, or if it fails, discard the attempt.

CRS Resource includes GSD (Global serveice Daemon), ONS (Oracle Notification Service), VIP, Database, Instance and service. These resources are divided into 2 categories:

Gsd,ons,vip and Listener belong to the Noteapps class

Database,instance and service belong to the Database-related Resource class.

We can understand this: Nodeapps means that each node needs only one, for example, each node has only one listener, and database-related Resource means that these resources and database, not restricted by the node, For example, a node can have multiple instances, and each instance can have multiple service.

GSD,ONS,VIP These 3 services are created and enlisted in OCR at the end of the installation Clusterware, when the VIPCA is executed. The database, Listener, Instance, and service are automatically or manually registered in OCR during the respective configuration process.

3). EVMD

EVMD This process is responsible for releasing various events generated by CRS. These events can be published to customers in 2 ways: ONS and Callout script. The user can customize the callback scripts to be placed in a specific directory, so that when something happens, EVMD automatically scans the directory and invokes the user's script. This invocation is done through the RACGEVT process.

In addition to complex release events, the EVMD process is a bridge between the CRSD and CSSD two processes. CRS and CSS Two services before the communication is done through the EVMD process.

4). Racgimon

Racgimon This process is responsible for checking the database health status, responsible for service start, stop, failover (Failover). This process establishes a persistent connection to the database and periodically checks for specific information in the SGA, which is updated regularly by the Pmon process.

5). Oprocd

OPROCD This process is also known as Process Monitor Daemon. If you are on a non-Linux platform and you do not use a third-party cluster software, you will see this process. This process is used to check the node's processor hang (CPU hangs), if the dispatch time exceeds 1.5 seconds, the CPU will be considered abnormal, will restart the node. This means that the process provides "IO isolation" functionality. From its service name on the Windows platform: Orafnceservice can also see its functionality. On the Linux platform, the Hangcheck-timer module is used to realize "IO isolation".

2.3 VIP Principle and features

Oracle's Taf is built on top of VIP technology. IP and VIP differences in the WITH: IP is the use of TCP layer timeouts, the VIP utilizes the immediate response of the application layer. VIP It is a floating IP. When a node is having problems, it automatically goes to the other node.

Suppose there is a 2-node RAC that has a VIP on each node when it runs normally. VIP1 and VIP2. When node 2 fails, such as an abnormal relationship. The RAC will do the following:

1). After detecting the RAC2 node anomaly, CRS will trigger clusterware reconstruction, and finally the RAC2 node is removed from the cluster, and the node 1 is composed of a new cluster.

2). The failover mechanism of RAC will transfer the VIP of Node 2 to Node 1, then there are 3 IP addresses on the public network card of Node 1: vip1,vip2, public IP1.

3). User connection request to VIP2 will be routed to Node 1 by IP layer Routing

4). Because there are VIP2 addresses on Node 1, all packets go smoothly through the routing layer, Network layer, and Transport layer.

5). However, only two IP addresses of VIP1 and public IP1 are monitored on Node 1. Does not listen to the VIP2, so the application layer does not have a corresponding program to receive this packet, this error is immediately captured. (This shows the VIP timeout, not relying on the operating system kernel TCP/IP protocol stack timeout mechanism, but relying on the application layer is the listener listener timely response)

6). The customer segment is able to receive this error immediately, and then the customer segment will re-initiate the connection request to VIP1.

VIP Features:

1). VIP is the last stage of the Clusterware installation, created by the VIPCA script

2). VIP as Nodeapps type of CRS Resource registered to OCR, and maintained by the CRS state.

3). VIP will be bound to the node's public network card, so the public network card has 2 addresses.

4). When a node fails, CRS transfers the VIP of the failed node to the other node.

5). The listener of each node listens to the public IP and VIP on the public network card at the same time

6). TNSNames of the client. Ora typically configures the VIP that points to the node.

2.4 Clusterware's log system

The auxiliary diagnostics for Oracle Clusterware can only be performed from log and trace. and its log system is more complex.

Alert.log:

$ORA _crs_home/log/hostname/alert. Log, this is the preferred view file.

Clusterware Background Process log:

CRSD. Log: $ORA _crs_home/log/hostname/crsd/crsd. Log

OCSSD. Log: $ORA _CRS_HOME/LOG/HOSTNAME/CSSD/OCSD. Log

EVMD. Log: $ORA _CRS_HOME/LOG/HOSTNAME/EVMD/EVMD. Log

Nodeapp Log Location:

$ORA _crs_home/log/hostname/racg/

It contains Nodeapp logs, including ONS and VIPs, such as Ora. Rac1.ons.Log

Tool Execution log:

$ORA _crs_home/log/hostname/client/

Clusterware provides a number of command-line tools:

such as Ocrcheck, Ocrconfig,ocrdump,oifcfg and Clscfg, the logs generated by these tools are placed in this directory

There's $oracle_home/log/hostname/client/and

$ORACLE _HOME/LOG/HOSTNAME/RACG also has a related log.

Ext.: http://blog.csdn.net/cymm_liu/article/details/7899038

Rac1--clusterware Concept Introduction 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.