Turn: Linux cluster-----ha discussion

Last Update:2016-11-09 Source: Internet

Author: User

Tags failover

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Entities that connect several servers together and provide failover functionality through special software are what we call high-availability clusters. Usability refers to the system's uptime, in the 7x24x365 work environment, 99% of the availability refers to the year can have 87 hours 36 minutes of down time, usually in the critical service of this day more than the time of failure is unacceptable, so put forward the aforementioned error recovery concept, To meet the high availability requirements of 99.999%. Here, let's start with a few concepts:

service is the resource provided in the HA cluster, including the float IP, shared storage, Apache, and so on. The
member server (Member server) is also called node, which is the server in Ha that actually runs the service provisioning resource. The
failure domain, Failover, is a collection of servers that provide resources in HA and can switch services to other healthy member servers when one of the internal members fails. In Ha, a failure domain typically contains 2 member servers (virtual technology is not applied). The
Heartbeat (HeartBeat) is a method of monitoring the state of member servers in Ha, and the heartbeat is typically transmitted over a network cable and serial line.
A single point of failure (FAILUER,SPOF) is one such part of a system that, when it fails or stops running, causes the entire system to not work. In Ha, it is common to use dual power supply, multi-NIC, dual switch to avoid spof.
Quorum (Quorum) is a method of storing member server information on a shared disk in Ha in order to accurately determine whether the server and the services it provides are normal. Shared state information includes whether the cluster is active. The service status information includes whether the service is running and which member is running the service. Each member examines this information to ensure that other members are up-to-date. In a two-member cluster, each member periodically writes a timestamp and cluster state information to the two shared cluster partitions on the shared disk storage area. To ensure proper cluster operation, a member cannot be allowed to join the cluster if it fails to write to the master shared cluster partition at startup and to mask the shared cluster partition. In addition, if a cluster member does not update its timestamp, or the "Heartbeats" (heartbeat) of the system fails, the member is removed from the cluster.
fence device, fence device when the role of a node in the event of a problem, the other node through the fence device to restart the problematic node, so that non-manual intervention and prevent the problem of the node access to shared storage, resulting in file system conflicts, For fence devices, there is an external power manager such as APC. Many servers are built-in, but different manufacturers are called different. For example, the HP called ILO,IBM called the Bmc,dell called Drac.

Below we take Redhat Cluster suite as an example to briefly talk about the HA construction. The current version of RedHat Cluster suite abbreviation RHCS,RHCS is V3,V4 and V5 and cannot be generalized between versions. The maximum number of nodes supported by RHCSV3/V4 in RedHat Enterprise Linux 3.0 and 4.0 for 16,redhat Enterprise Linux 5.0 corresponds to the maximum number of nodes supported by RHCSV5 128. Redhat High-availability cluster composition:

Cluster configuration System (CCS): Cluster provisioning systems to manage cluster.conf archives
Cluster Manager (CMAN): Cluster Manager
Distributed lock Manager (DLM): Distributed lock Manager
Fence: input/output system fence System (Gate device)
Resource Group Manage (Rgmanager): Resource group manager for monitoring, starting, and stopping apps, services, and resources
Quorum Disk: Quorum diskette
CONGA:RHCM's Web Control suite. including Luci and Ricci.
System-configure-cluster: Graphical tools to manage multiple machines in a cluster

For an HA clustered environment: from the firewall to the server NIC is ha mode. Here the switch I was taking Cisco as an example, turned on HSRP. Environment Description: 2 servers for IBM Blade HS21 (2*2.5g above 4-core processor, 8G memory, 2*146g HDD, integrated redundant dual-port card) fence device: Advanced Management Module (AMM) took advantage of the free lunch offered by IBM as a fence device (seemingly given money =.=!). IP is 192.168.110.47, username is userid, password is passw0rdos:redhat Enterprise Linux Advance Server 4 update 6 ready to work: 1. Configuration bonding:

Create a new http://www.qixoo.qixoo.com/etc/sysconfig/network-config/ifcfg-bond0 with the following content:

device=bond0
Bootproto=static
broadcast=192.168.100.255
ipaddr=192.168.100.21
netmask=255.255.255.0
Onboot=yes
Type=ethernet

Edit the real NIC to bind eth0

# Vi/etc/sysconfig/network-scripts/ifcfg-eth0

Device=eth0

Bootproto=none

master=bond0

Slave=yes

Onboot=yes

Edit the real NIC to bind eth1

# vi/etc/sysconfig/network-scripts/ifcfg-eth1

device=eth1

Bootproto=none

master=bond0

Slave=yes

Onboot=yes

To configure the binding mode:

Edit the/etc/modules.conf file, add the following line to enable the system to load the bonding module at startup, the external virtual network interface device for BOND0 add the following two lines

# vi/etc/modules.conf

alias bond0 Bonding

options bond0 miimon=100 mode=1 primary=eth0

Note: The Miimon is used for link monitoring. For example: miimon=100, then the system every 100ms monitoring a road connection state, if one line is not connected to another line; The value of mode indicates the mode of operation, there are 0, 1, 2, 34 modes, commonly 0, 12 kinds: mode=0 means load balancing (round-robin) is load balanced and both NICs work. Mode=1 indicates that fault-tolerance (active-backup) provides redundancy, working in a way that is the primary and standby mode, by default, only one NIC works, the other is backed up, and primary specifies the network card that is active after startup.

Modifying the default route

It is described by the environment that 192.168.100.0 network segment is a running business, so we need to specify a default route

# more/etc/sysconfig/network
Networking=yes
Hostname=am1
gateway=192.168.100.2

Restart the Network Service app just configured

#/etc/init.d/network Restart

Bringing up interface bond0 okbringing up interface eth0 okbringing up interface eth1 OK

# Cat Http://qkxue.net/proc/net/bonding/bond0
Ethernet Channel Bonding Driver:v2.6.3-rh (June 8, 2005) Bonding Mode:fault-tolerance (active-backup)
Primary Slave:none
Currently Active Slave:eth0
MII Status:up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0Slave Interface:eth0
MII Status:up
Link Failure count:1
Permanent HW Addr:xx:xx:xx:xx:xx:xxSlave interface:eth1
MII Status:up
Link Failure count:1
Permanent HW addr:xx:xx:xx:xx:xx:xx

Bond1 configuration as above, AM2 server configuration with AM1.

2. Modify the/etc/hosts file to add the host name and IP address of the node. The file must be identical on both blades. # vi/etc/hosts127.0.0.1 Localhost.localdomain localhost192.168.110.21 amcluster1192.168.100.21 am1192.168.110.22 amcluster2192.168.100.22 am23. Install the RHCS kit # Mount–o Loop rhel-4-u6-rhcs-i386-disc1.iso/mnt# cd/mnt#./autorun Pop-up Install cluster Software dialog box, select the software you want to install, all about Xen, large , huge, Luci, Ricci the words of the package are not selected, click Next, if there is no error message, the installation is successful. If you select more than one package, you may need to insert the Redhat installation disk. Note: You must turn off Redhat SELinux and iptables, of course, if you know cluster software port on the firewall open is also possible ^_^. 4. Stop all cluster services on two machines

#/etc/init.d/rgmanager Stop

#/etc/init.d/fenced Stop

#/etc/init.d/cman Stop

#/etc/init.d/ccsd Stop

Modify the/usr/share/system-config-cluster/faildomcontroller.py file, the No. 213 line less a "=", the problem is cluser itself bug.

5. Write the cluster startup and shutdown scripts:

#more/root/cluster.sh#!/bin/shstart () {
/ETC/INIT.D/CCSD start
/etc/init.d/cman start
/etc/init.d/fenced start
/etc/init.d/rgmanager start
}stop () {
/etc/ini T.d/rgmanager stop
/etc/init.d/fenced stop
/etc/init.d/cman stop
/ETC/INIT.D/CCSD stop& nbsp
}status () {
/etc/init.d/rgmanager status
/etc/init.d/fenced status
/etc/init.d/ Cman status
/ETC/INIT.D/CCSD status
}case "$" in
Start)
start $1
; ;
Stop)
Stop $1
exit 0
;
Restart|reload)
Stop $1
start $1
retval=$?
;
Status)
status
;;
*)
Echo $ "Usage: $ {start|stop|restart|status}"
Exit 1
Esac Modify Permissions #chmod 700/r oot/cluster.sh

Start configuration cluster: (using the RHCSV5 version of the System-config-cluster tool)

1. Run under the Xwindow

# System-config-cluser

Open the Graphical configuration tool:

2. Select the Create NEW configuration button, enter the file name and multicast address, and then tap OK. Then select the Cluster on the left and select "Edit Cluster Properties" to bring up the name and related properties of "Cluster properies" configuration Cluster. The "Postjoindelay" gate process receives the new node's join request and waits for the node to confirm the delay in seconds, the default value is 3, and the typical setting is 20 to 30, depending on your network and cluster's actual situation. The "Postfaildelay" gate process discovers that the node kicks out of the error recovery domain after the wait delay time, the default value of 0 means to kick out immediately, this value depends on the actual situation of your cluster network.

3. Check Cluster Nodes, this is the join member node, click Add a Cluster node, enter Amcluster1 point OK, for example, Node name needs to be the same as the machine name configured in the/etc/hosts file, add Amcluster2. 4. Select Fence Devices, click Add a Fence device, for example, Fence devices need to be configured according to the actual situation, this installation of IBM Blade Server using IBM Blade Center management, the modified IP is 192.168.110.47 , username is userid, password is passw0rd, click OK.

5. Select the node you just added Amcluster1, point manage Fencing for this node, pop up the dialog, click Add a new Fence level, then select Fence-level-1 on the left, point add a new Fence to this Level, pop the following dialog box, enter the slot where the blade is located, and click OK to configure the Amcluster2 fence device in the same way.

6. Create the Invalidation field, add the corresponding node, select Failover Domains, and click Create a Failover domain in the lower right corner. In the popup dialog box, enter the name of the invalid field, where we enter AMHA after saving. 7. Add the failed domain member node to the right of the two tick to the "Restrictfailover Domains members" and "Priotitzed List" respectively. If you only have two units, tick the first "Restrict Failover Domains members" on it, this is only allow the two to do rotation. Another "priotitzed List" is when you have more than two servers, you need to set the priority of the rotation separately. When you tick "priotitzed List" You can use "adjust priority" to adjust the number of replacement priorities for multiple nodes 8. To create a resource group Select resources, click Create a in the lower right corner, select Create a Resource, then select IP Address and fill in the float IP. Then add the Tomcat startup script, as shown in: 9. To create a service, click the Create a service in the lower right corner, and in the dialog box that appears, click OK, Do the following configuration note: 1. Red circled drop-down box to select the invalid field name AMHA2 that we just created. If you choose Restart, if you select the service exception, cluster try to restart the service, if you select relocate, the service will attempt to switch service to another On one node. After configuring these two steps, click on the lower left corner of the pop-up dialog box to add the service you want to switch. Here we need to remind you that it is best to put the service script under the IP, only the network card to serve, logically correct. Operation Flow: Select Create a new resource for this service and then select IP Address so that the address resource is loaded, select the newly created address, and then select Attach a new Private resource to T He Selection, select Script here and add the scripts we need to serve. 10. Save configuration and sync to standby the final cluster.conf file contents should be structured as follows: <?xml version= "1.0"?>
<cluster alias= "Amha" config_version= "2" name= "AMHA" >
<fence_daemon post_fail_delay= "0" post_join_delay= "3"/>
<clusternodes>
<clusternode name= "Amcluster1" nodeid= "1" votes= "1" >
<multicast addr= "225.0.0.1" interface= "Bond0"/>
<fence>
<method name= "1" >
<device blade= "1" name= "AMM"/>
</method>
</fence>
</clusternode>
<clusternode name= "Amcluster2" nodeid= "2" votes= "1" >
<multicast addr= "225.0.0.1" interface= "Bond0"/>
<fence>
<method name= "1" >
<device blade= "2" name= "AMM"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes= "1" two_node= "1" >
<multicast addr= "225.0.0.1"/>
</cman>
<fencedevices>
<fencedevice agent= "Fence_bladecenter" ipaddr= "192.168.110.47" login= "USERID" name= "AMM" passwd= "Passw0rd"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name= "Amha" ordered= "0" restricted= "1" >
<failoverdomainnode name= "Amcluster1" priority= "1"/>
<failoverdomainnode name= "Amcluster2" priority= "1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address= "192.168.100.18" monitor_link= "1"/>
<script file= "/etc/init.d/tomcat" name= "Tomcat"/>
</resources>
<service autostart= "1" domain= "Amha" name= "AMHA" >
<ip ref= "192.168.100.18" >
<script ref= "Tomcat"/>
</ip>
</service>
</rm>
</cluster> can pass # scp/etc/cluster/cluster.conf:/etc/cluster

You can
also use the Systemconfigcluster graphics cluster Configuration tool to simply click the "Send to" button to pass. You can also tell the CCS upgrade profile version through the tools provided by RHCs: 1.# ccs_tool update/etc/cluster/cluster.conf. 2.# Cman_tool Status | grep "Config version", first find out the version of your cluster.conf file. See if the returned east is the same as your new version, and you will need to update it with the following command. # Cman_tool Version-r New_version_number3. Verify that the latest is OK. The above operation when RHEL4, in the RHEL5 user only need to use Ccs_tool update/etc/cluster/cluster.conf to be able.

You can start the service here using the cluster startup script that preceded it.

Turn: Linux cluster-----ha discussion

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More