In this mode, all nodes can provide services (no user requests are idle on standby ). In most cases, the hardware configurations of cluster members are the same, avoiding possible performance problems and implementing load balancing more easily. Active/active clusters require more complex management software to manage all resources. For example, disks and memory must be synchronized between all nodes. More often, a private network is used for heartbeat connections. Cluster Management software must be able to detect node problems, such as node faults or split-brain, which is a bad situation in the Cluster: When all clusters in the cluster are working, internal communication is disconnected. In this case, the cluster is divided into several parts, and each part of the cluster software will try to take over the resources of other nodes, because in its view, other nodes are faulty. The following problems may occur: if the application can connect to these parts of the cluster normally, because these cluster parts are not synchronized at this time, different data may be written to the disk. Split-brain has obvious harm to clusters. Cluster software vendors must provide solutions to solve this problem. Oracle cluster software (grid infrastructure in 11g) uses a quorum device ), it is called voting disk to determine the members in the cluster. All nodes in the cluster share a voting disk. When a node cannot send heartbeat to the internal network or voting disk, It is evicted from the cluster. If a node cannot communicate with other nodes but can still connect to the voting disk, the cluster will vote in this case and issue instructions to remove the node. This vote uses the stonith method, and the software will send a request to automatically restart the kicked node. When the Hung node needs to be restarted, the restart command becomes unavailable, Which is tricky. Fortunately, grid
Infrastructure supports IPMI (an intelligent platform management interface) and sends an end command to a node. When a node fails or is kicked out of the cluster, the remaining node can take over user service requests. Configure an active/passive cluster. The hardware configurations of members in an active/passive cluster should be consistent or basically consistent, but only one of the two nodes can process user requests at the same time. The cluster management software constantly monitors the health status of resources in the cluster. When a resource fails, the cluster management software attempts to restart the resource several times. If the resource is still invalid, the slave node will take over. Based on the options during installation, cluster resources can be allocated to shared storage or file systems, and the latter will also perform a failover when the resource is failover. Using a shared file system is more advantageous than using a non-shared file system, which may require fsck (8) detection before being remounted to a standby node. The Veritas cluster suite, Sun (Oracle) cluster, and IBM hacmp can be used as cluster management tools to install active/passive clusters. It is rarely known that it is very simple to use Oracle grid infrastructure to install an active/passive cluster. It uses the application interface of grid infrastructure and the Oracle ASM as the logical volume manager of the cluster, you can easily monitor a single-instance Oracle database without interruption. When a node fails, the database is automatically migrated to the slave node. Based on the size of the initialization parameters fast_start_mttr_target and recovery set, this failover may be very fast. However, as part of the Failover process, the user's database connection will be disconnected. Active/passive mode can be enabled by setting the active_instance_count parameter to 1, but only when the number of nodes is 2.
Configure the shared-All architecture. A cluster with all nodes simultaneously accessing shared storage and data is called the shared-all or shared-Everything structure. Oracle RAC is based on the shared-Everything architecture: a database is stored in shared storage and accessed through instances running on each node of the cluster. In Oracle terminology, an instance consists of a memory structure and some processes. The database is stored in the data file on the disk. In RAC, instance failure does not mean that the data managed by the instance is lost. When one node fails, another instance in the cluster will be restored, and all remaining nodes will continue to serve. The use of high availability technology, such as FCF or TAF, can minimize the impact of instance failure on users. The faulty node will be added to the cluster again and share the workload. The shared-nothing architecture is configured in a shared-nothing database cluster. Each node has its own private independent storage, and other nodes cannot access it. The database is divided into several parts by nodes in the cluster. The returned query structure set is a combination of result sets of each node. If a node is lost, the corresponding data cannot be accessed. Therefore, a shared-Noting cluster is often implemented into some independent active/passive or active/active clusters to enhance availability. MySQL clusters are based on the shared-nothing architecture. The main concept of RAC cluster node cluster consists of individual nodes. in Oracle RAC, the number of nodes allowed is related to the cluster version. The public documents show that the Oracle 10.2 cluster software supports 100 nodes, 10.1 supports 63 instances. Even if an application on RAC can continue to run when a node fails, you should make some effort to confirm that a single component in the database server will not have spof ). When purchasing new hardware, hot swapping components should be used, such as built-in disks and fans. In addition, server power supply, host bus adapter, Nic and hard disk should be redundant. If possible, it is best to make a logical binding, such as hard disk hardware raid or software raid, Nic binding, and multi-path storage network. In the data center, you should also note that uninterrupted power supply, sufficient heat dissipation measures, and professional Server installation are required. It is better to have a remote lights-out console. When a node does not know why it is suspended, it may urgently need to be rectified or restarted. The internal interconnection of internal clusters is one of the features of Oracle RAC. It not only makes the cluster break through the limitations of the block pinging algorithm when passing data blocks between different instances, but also can be used for heartbeat and regular communication. If the connection fails, the cluster is reorganized to avoid split-brain. Grid infrastructure restarts one or more nodes. You can configure a separate connection between RAC and grid infrastructure. In this case, you need to configure RAC to use the correct connection. This connection should always be private and should not be disturbed by other networks. RAC users can use two technologies to achieve internal interconnection: Ethernet and InfiniBand. Ethernet-based internal interconnection using 10G Ethernet as the internal interconnection of the cluster may be the most commonly used at present, and the background processes of the cluster use TCP/IP for communication. Cache fusion (used to maintain cache consistency) uses another communication method: UDP (ramprotocol for userdata ). Both UPD and TCP belong to the transmission layer. The latter is connected and uses an explicit communication handshake to ensure that network packets arrive in order and forward failed packets. UDP does not contain the status. It is a fire-and-forget protocol. UDP simply sends a data packet to the destination. The main advantage of UDP over TCP is that it is relatively lightweight. Note: Do not use cross lines for direct connection between two-node clusters. communication within the cluster must be exchanged. Use of cross cables should be explicitly prohibited! Jumbo frames can improve the efficiency and performance of intra-cluster communication. Ethernet frames can be of different sizes, which are generally limited to bytes (MTU value ). The size of the Framework determines how much data can be transmitted by a single Ethernet framework. The larger the data load of a framework, the less work the server and switch need to do, providing higher communication efficiency. Many switches allow a larger number of bytes (1500-9000) than the standard MTU value in a framework, also known as jumbo frame. Note that jumbo frames cannot be routed, so it cannot be used on a public network. When jumbo is used
During frames, make sure that all nodes in the cluster use the same MTU. I have said that the components related to the database server should be easy, and the NIC is also one of them. Multiple network ports can be bound to a logical unit using Bonding Technology in Linux. unlike many other operating systems, Nic binding in Linux can be achieved without buying other software.
InfiniBand-based internal Interconnectivity is often used to achieve direct remote memory access (rdma remote Direct Memory Access architecture ). This is a high-speed Interconnection that is often associated with the high-performance computing (HPC) environment. Rdma can use parallel, direct, and memory-to-memory transmission among cluster nodes. It requires dedicated rdma adapters, switches, and software. It also avoids the cost of CPU processing and environment Conversion Based on Ethernet. In Linux, there are two ways to achieve InfiniBand interconnection. The first type is IP over InfiniBand (ipoib). It uses the IB architecture as the link control layer and uses encapsulation to convert IP addresses and IB packets, so that programs running on Ethernet can run directly on InfiniBand. Another method is to use InfiniBand-based reliable.
Datatesockets and Oracle 1.2.0.3 support this method. RDS can be implemented through open fabric enterprise distribution (ofed) on Linux and Windows. RDS features low latency, low overhead, and high bandwidth. The Oracle database server and the exadata storage server use InfiniBand to provide up to 40 Gbit/s bandwidth for communications within the cluster, which is impossible for Ethernet. InfiniBand presents great advantages for high performance, but its cost is also very high.
Clusterware/grid infrastructure is closely integrated with the operating system and provides the following services: connection between nodes; maintenance of cluster members; message transmission; cluster logical volume management; isolation (fencing) **************************************** **************************************** **************************************** **************************************** ********************************I/O isolation:
When the cluster system encounters a "Split-brain" problem, we can use the "voting algorithm" to solve the problem of who obtains control of the cluster. However, this is not enough. We must ensure that the evicted node cannot operate on shared data. This is the problem to be solved by Io fencing.
Io fencing can be implemented in two ways: hardware and software:
Software: For storage devices that support the SCSI reserve/release command, use the SG command. A normal node uses the SCSI Reserve Command to "Lock" the storage device. When the faulty node finds that the storage device is locked, it knows that it has been driven out of the cluster, that is, it has encountered an exception, restart the instance to restore it to normal. This mechanism is also called sicide (suicide). Sun and VERITAS use this mechanism.
Hardware: shoot the other node in the head. This method directly operates the power switch. If one node fails, if the other node can detect, A command is issued through the serial port to control the power switch of the faulty node. the faulty node is restarted by means of temporary power failure and power-on. This method requires hardware support.**************************************** **************************************** **************************************** **************************************** ********************************
The Oracle cluster software of each version is named as follows: after the process structure is installed, some background processes are generated to ensure that the cluster works normally and can communicate with the outside. The requirements for some ordered Linux platforms must be started with the root user permission. For example, you need higher permissions to modify the network configuration. Other background processes run with the permissions of the system users where the grid software is located. The following table describes the main background processes.
Background Process |
Description |
Oracle High Availability Service (ohas) |
Ohas is the first grid infrastructure component opened after the server is started. It is configured to open with Init (1) and is responsible for generating the Agent process. |
Oracle agent |
Grid infrastructure uses two Oracle proxy processes. First, in summary, it is responsible for opening resources that need to access OCR and voting files. It is created by ohas. The second proxy process is created by crsd and is responsible for opening all resources without the root permission. This process runs with the permissions of the user to which the grid infrastructure belongs and is responsible for racg's work in rac11.1. |
Oracle root agent |
Similar to the Oracle proxy process, two root proxy processes are created. The initial proxy process is triggered by ohas, which provides initialization for resources in Linux that require higher permissions. The main background processes created are cssd and crsd. In turn, crsd will trigger another root proxy. This proxy will open resources that require root permission and primarily related to the network. |
Cluster ready service process (Crsd) |
The main background process of the cluster software. The Oracle cluster registration information is used to manage resources in the cluster. |
Cluster Synchronization Service Process (Cssd) |
Manage cluster configurations and node members |
Oracle Process Monitoring (Oprocd) |
Oprocd 11.1 is responsible for I/O isolation. It was introduced to the Linux System in the 10.2.0.4 patch set. Before this patch set, the kernel hangcheck-timer module performs similar tasks. Interestingly, oprocd was often used on non-Linux platforms. Grid infrastructure replaces the oprocd process with the cssdagent process. |
Event manager (EVM) |
The EVM is responsible for releasing the grid infrastructure creation event. |
Cluster Time Synchronization Service (ctss) |
Ctss is an option. The Network Time Protocol server provides time synchronization for the cluster. This time synchronization is very important to RAC. It can run in two modes: wait-and-view or activity. When NTP is activated, it runs in observation mode. If NTP is not started, it will synchronize the time of all nodes on the master node. |
Oracle Warning Service (ONS) |
Major background processes that release events through the quick application framework. |
In rac11.2, the startup sequence of grid infrastructure has changed significantly. Instead of opening CRS, CSS, and EVM directly through inittab (5), the ohas process is mainly responsible for creating the Agent process, monitoring the health status of other nodes, and opening cluster resources. In non-Oracle management processes, NTP is a special role. In each cluster, it requires clock synchronization, and grid infrastructure is no exception. Below are some of the main background processes of grid infrastructure in 11.2: to configure the network component grid infrastructure, some IP addresses are required for normal operation: each host has a public network address, and each host has a private network address; each host has one virtual IP address (not assigned). 1-3 unassigned IP addresses are used for the single client access name feature. If grid plug-and-play is used, an unused virtual IP address is also required to be assigned to the grid Naming Service. Node virtual IP is one of the most useful functions of Oracle clusters. They must be configured in a CIDR block with a public IP address and maintained as cluster resources in grid infrastructure. When a node fails in 9i, the public IP address cannot respond to the connection request. When a client session tries to connect to this faulty node, it must wait until the connection times out, which may be a long process. With the virtual IP address, it will be much faster: When a node fails, grid infrastructure failover the virtual IP address of the node to another node in the cluster. When a client session is connected to the virtual IP address of the faulty node
Infrastructure knows that this node cannot work normally and connects it to the next node in the cluster. Another requirement is 1-3 IP addresses. No matter how big the cluster is, this requirement is added in grid infrastructure. This address type is called scan (single client access name ). Scan is created and configured during upgrade or installation of grid infrastructure. Before installation, you need to add these scan IP addresses to DNS for loop resolution. If you use the grid Naming Service (GNS), you need to assign a virtual IP address to it on the public network. From Pro Oracle Database 11g RAC on Linux