HACMP can be said to be a server engineer, and there is no unknown-IBM's housekeeping architecture. I have recently asked readers to talk about the issue of HACMP by phone or email.
Here, I will repost an article from IBM, which is very well written and accurate. It is a lot better than what I said. Everyone will benefit from reading it! Wen Ping
Detailed explanations on oracle large database systems in AIX/unix: 45. What is HACMP?
========================================================== ========================================================== ==================
The main function of Hacmp (High Availability Cluster Multi-Processing) Hot Backup Software is to improve the reliability of customers' computer systems and their applications, rather than the reliability of a single host.
I. Working Principle of Hacmp dual-host system
HACMP uses a LAN to monitor the status of hosts, networks, and NICs. In a HACMP environment, there are TCP/IP and non-TCP/IP networks. The TCP/IP network is the public network accessed by the application client. It can be the network supported by most AIX, such as Ethernet, T. R., FDDI, ATM, SOCC, and SLIP. A non-TCP/IP network is used to provide a communication path that replaces TCP/IP for HACMP to monitor nodes in the HA environment (Cluster, it can be used to connect nodes with RS232 serial lines, or set the scsicard or SSA card of each node to the Target Mode.
1. Two servers (host A and host B) Running the Hacmp software simultaneously;
2. The server serves as the backup host of the other Party in addition to the applications running on the current host;
Host A (running application): Service_ip: 172.16.1.1 Standby_ip: 172.16.2.1 Boot_ip: 172.16.1.3 host B (standby): Service_ip: 172.16.1.2 Standby_ip: 172.16.2.2 Boot_ip: 172.16.1.4
3. During the entire operation of the two host systems (A and B, monitor each other's running status through the heartbeat line (including system software and hardware running, network communication, and application running status );
4. Once the application on the faulty host is found to be abnormal (out of fault), the application on the faulty host will immediately stop running. This machine (backup machine of the faulty host) the application on the faulty machine will be started on the machine immediately, and the application and resources (including IP addresses and disk space used) of the faulty machine will be taken over, enable the applications on the faulty machine to continue running on the local machine;
5. The Ha software automatically takes over applications and resources without manual intervention;
6. When the two hosts are working normally, you can manually switch the applications on one server to another (Backup Server) as needed.
HACMP dual-host System Structure
Ii. Preparations before installing and configuring Hacmp
1. Clearly define the applications to be run by the two server hosts (for example, machine A runs the application and machine B acts as standby );
2. Assign Service_ip, Standby_ip, boot_ip, and heartbeat tty to each application (group), for example:
3. Create a disk group and allocate disk space according to the application requirements of each host;
4. Modify the parameters of the server operating system as required by the Ha software.
Iii. Solutions for the ibm hacmp dual-host server system
The installation and configuration of HACMP are as follows:
(1) install the HACMP software on both servers.
#smit installp
(2) check whether the software installed on the two hosts is successful.
#/usr/sbin/cluster/diag/clverify
software
cluster
clverify>software
Valid Options are:
lpp
clverify.software> lpp
If no error occurs, the installation is successful.
(3) configure the boot IP address and Standby IP address of the two servers respectively to ensure that the boot network and Standby network can be pinged (run the smit tcpip command), and run the netstat command to check whether the configuration is correct:
#netstat -i
(4) use smit tty to add a TTY interface to the two hosts and configure the heartbeat line (RS232 ):
#smitty tty TTY tty0 TTY type tty TTY interface rs232 Description Asynchronous Terminal Status Available Location 20-70-01-00 Parent adapter sa2 PORT number [0] Enable LOGIN disable BAUD rate [9600] PARITY [none] BITS per character [8] Number of STOP BITS [1]
Use lsdev-Cc tty to check whether tty is configured.
#lsdev –Cc tty
Enter the following commands on the two platforms:
S85_1# cat /etc/hosts >/dev/tty0
S85_2# cat
If the S85_2 function receives information, it indicates that the heartbeat line is configured.
(5) Specific configuration and skills
Note: The HACMP configuration (or configuration modification) only needs to be performed on one of the hosts. After the configuration (or modification) is completed, run the synchronization command to upload the configuration result to another host. Generally, S85_1 is selected for configuration.
Run smit hacmp on S85_1 and configure it as follows:
#smit hacmp
1. Cluster Configuration
1.1 configure Cluster Topology
Configure Cluster/Add a Cluster Definition * Cluster ID [100] * Cluster Name [sb_ha] to Configure Nodes, add two Node * Node Names [s85_a] * Node Names [s85_ B] to Configure Adapters, and Configure the service address and boot address of the two machines respectively, standby address and tty (a_svc, B _svc, a_boot, B _boot, a_stdby, B _stdby, a_tty, B _tty) * Adapter IP Label a_svc Network Type [ether] Network Name [ethnet] Network Attribute public Adapter Function service Adapter Identifier [172.16.1.1] Adapter Hardware Address [] Node Name [s85_a] Modify/etc/ hosts And /. rhosts file, such as: Modify the/etc/hosts file, add the following content: 172.16.1.1 a_svc 172.16.1.2 B _svc 172.16.1.3 a_boot 172.16.1.4 B _boot 172.16.2.1 a_stdby 172.16.2.2 B _stdby Modify /. add the following content to the rhosts File: a_svc B _svc a_boot B _boot a_stdby B _stdby
1.2. Synchronize cluster (Cluster Configuration/Cluster Topology/Synchronize Cluster Topology)
During synchronization, you can perform Emulate synchronization first, and then perform actual (actual) Synchronization After the simulation is OK:
Synchronize Cluster Topology
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP] [Entry Fields]
Ignore Cluster Verification Errors? [No] +
* Emulate or Actual? [Emulate] +
2. Configure Cluster Resources
2.1 Define a Resource Group)
Note: when defining a resource group, pay attention to the order of the Node Names in which the object is ranked.
Resource Group Name data_res New Resource Group Name [] Node Relationship cascading participant ipating Node Names [s85_a s85_ B] 2.2. Define Application Servers) server Name ora_app New Server Name [] start Script [/etc/start] Stop Script [/etc/stop] 2.3. Modify Resource Group attributes (Change/Show Resources for a Resource Group) data_res Group: Resource Group Name data_res Node Relationship cascading participant ipating Node Names s85_a s85_ B Service IP label [a_svc] Filesystems (default is all) [] Filesystems Consistency Check fsck Filesystems Recovery Method sequential Filesystems to Export [] Filesystems to NFS mount [] Volume Groups [datavg logvg] Concurrent Volume groups [] Raw Disk PVIDs [] Application Servers [ora_app]
2.4. Synchronize Cluster Resources)
During synchronization, you can perform Emulate synchronization first, and then perform actual (actual) Synchronization After the simulation is OK:
Synchronize Cluster Resources
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP] [Entry Fields]
Ignore Cluster Verification Errors? [No] +
* Emulate or Actual? [Emulate] +
3. startup and shutdown of HACMP
(1) Startup Process:
#smit clstart
#tail –f /tmp/hacmp.out
May 22 17:29:23 EVENT COMPLETED: node_up_complete s85_a
If the/tmp/hacmp. out file displays similar information, it indicates that HACMP can be started properly on the local machine.
(2) Closing Process:
#smit clstop
4. Test the HACMP Function
After the HACMP configuration is complete and the check is correct, you can start the HACMP in step 3 to test the function. This includes whether two NICs of an application can be switched on the same server or between the two servers. Available commands:
#netstat-in
View address Switching
Iv. common troubleshooting methods for HACMP
HACMP will diagnose and respond to three types of faults: 1 Nic fault, 2 network work, and 3 node fault. The three types of faults are described below.
1. Nic fault
In the HACMP cluster structure, in addition to the TCP/IP network, there is also a non-TCP/IP network, which is actually a "Heartbeat" line, it is used to diagnose whether a node crashes or only a network failure occurs. As shown in, once the node is added to the Cluster (that is, the HACMP on the node has been started properly ), each network card of this node, non-TCP/IP network will continuously receive and send the Keep-Alive signal, the parameters of the K-A is adjustable, after sending a certain number of packets consecutively, HA can confirm that the peer network card, network, or node is faulty. Therefore, with the K-A, HACMP can easily find the Network Card fault, because once a network card failure to the network card K-A will be lost.
In this case, the cluster manager (the "brain" of HACMP) on node 1 generates a swap-adapter event, execute the script of the event (HACMP provides the event scripts in most general environments, which are written using standard AIX commands and HACMP tools ). Each node has at least two NICs, one being a service adapter that provides external services and the other being a standby adapter. Its existence is only known by the cluster manager, but not by the application and client.
In the event of a swap-adapter event, cluster manager transfers the IP address of the original service adapter to the standby adapter, and the standby address to the faulty Nic. At the same time, other nodes on the network refresh the ARP. The network adapter can be switched within a few seconds. The Ethernet is 3 seconds, and the conversion is transparent to the client, with only latency but connection is not interrupted.
2. network faults
If all the K-A packages sent to the service and standby NICs on node1 are lost, rather than the K-A on the TCP/IP network still exists, HACMP determines that node1 is still normal and the network is faulty. At this time, HACMP executes one.
3. node faults
HACMP determines that the node fails and generates a node-down event if not only the K-A on the TCP/IP network is lost, but also the K-A on the non-TCP/IP network is lost. At this time, there will be resource taking over, that is, the resources placed on the shared disk display will be taken over by the backup node, including a series of operations: Acquire disks, Varyon VG, Mount file systems, Export NFS file systems, assume IP network Address, Restart highly available applications, where IP addresses take over and Restart applications is implemented by HACMP, while others are implemented by AIX.
When the entire node fails, HACMP transfers the service IP address of the faulty node to the backup node so that the client on the network still uses this IP address. This process is called IPAT ), after a node is down, if the IP address is set to take over, the clients on the network will be automatically connected to the take over node. Similarly, if the application takes over is set, the application is automatically restarted on the slave node to enable the system to continue providing external services. For applications to be managed, you only need to set them to application server in HACMP, then, it tells HACMP the full path name of the start script to start the application and the full path name of the stop script to stop the application. It can be seen that the configuration of application takeover is very simple in HACMP. The important thing is the writing of start script and stop script, which requires the user to understand their own applications.
4. Other faults
HACMP only detects network adapter, network, and node faults and performs transfer and take over actions accordingly. For other faults, HACMP does not take any action by default.
A. Hard Disk fault
Generally, we set the hard disk to RAID-5 or mirror to provide high availability of the hard disk. RAID-5 disperses the parity check bit in the hard disk group. Therefore, when a hard disk in A group breaks down, other hard disks in the group can be restored through the parity check bit. RAID-5 is generally implemented by hardware, as shown in the following figure. If two hard disks in the same group break down, data on these hard disks may be lost. The mirror method writes the same data to at least two physical external servers. Therefore, it is not as efficient as RAID-5 and has a large disk capacity, but is more secure than RAID-5, it is easy to implement and can be easily set through Logic Volume Management in AIX.
B. Hard disk Controller
The storage device must use a control card to connect to the host. The SCSI device is a SCSI Adapter and the SSA device is a SSA Adapter. If the card breaks down, the connected peripherals cannot be used. There are several ways to solve this problem.
One way is to use multiple adapters. Each host has two or more adapters that connect to the mirror data respectively. Therefore, whether the hard disk or the adapter breaks down, all the good data can still be used by the host, no single point of failure. This method is not difficult to implement, but you must configure multiple adapters and use the data mirror method. This method does not need to be implemented through HACMP.
Another method is to use only one adapter and solve the problem by using the Error Notification Facility (Error Notification Mechanism) in HACMP.
Error Notification Facility is a monitoring tool provided by HACMP for other devices. Any error reported to AIX can be captured and taken accordingly. HACMP provides the smit interface to simplify the configuration.
We already know that LVM can be used to implement a hard disk image. When a disk breaks down, there is still a copy of data in the image disk, and the data can still be read and written, but the data is no longer available at this time, if the image disk breaks down, all data is lost. In this example, the PV loss (LVM_PVMISS) information is displayed on the console, prompting you to carefully check the error log to locate the fault and fix it. In this example, HACMP provides an interface that integrates with the AIX functions to monitor the occurrence of faults.
C. Application faults
If your application has a kernel call or is started as a root user, once the application fails, the operating system may be down and crashed. In this case, the node is actually faulty, HACMP will take appropriate measures. If the application only dies, AIX is still running normally. HACMP uses Error Notification Facility at most to provide the monitoring function and does not take any action on the application itself. However, if the application calls the API provided by the SRC (System Resource Controller) Mechanism of AIX, the application can be automatically restarted after it is down. In addition to the src api, The clinfo in HACMP also provides this API.
Clinfo is the cluster Information daemon, which maintains the status Information of the entire cluster. The clinfo API allows applications to take actions based on the status Information.
D. HACMP fault
If the HACMP process of the node in the cluster is down, HACMP will upgrade it to a node fault, resulting in resource taking over.
As mentioned above, HACMP is solely responsible for diagnosing network card faults, network faults, and node faults, and is responsible for IP address conversion or take over, and take over the entire system resources (hardware, files, systems, applications, and so on. For other faults other than these three types of faults, you can combine the basic features of AIX with some mechanisms provided by HACMP, such as Error Notification Facility and clinfo API, it can also monitor faults and take corresponding measures.
Wen Ping
Detailed explanations on oracle large database systems in AIX/unix: 45. What is HACMP?
========================================================== ========================================================== ====