Context
You have decided to use the cluster to provide highly available services when designing or modifying the infrastructure layer.
Problem
How should you design a highly available infrastructure layer to prevent service loss due to a single server or the software it is running on?
Impact factors
When designing a highly available infrastructure layer, consider the following factors:
- A hardware component, application, or service failure can make an application unusable or unavailable. For example, imagine a power failure on a server that is providing an application. If this is the only server or unique power supply in the server, there is a failure point and the application will not be available.
- Planned server downtime can affect the availability of the application. For example, if you are updating an operating system on a database server that has no standby server, you may have to stop the application from running to install the patch on the server.
- Monitoring and maintaining the multi-server tier increases the requirements for system and network resources.
- Applications that use failover clustering may require special coding to ensure that the failover process is transparent to the user when a failure occurs, and that the application is still available. For example, if you place time-outs and retries in the code used to save data to the database, you can ensure that the transaction will complete when the failover occurs.
Solution Solutions
Install applications or services on multiple servers that are configured to take over each other when they fail. The process by which a server takes over a failed server is often referred to as a "failover." A failover cluster is a set of servers that are configured so that if one server becomes unavailable, another server automatically takes over the failed server and continues processing the task. Each server in the cluster identifies at least one other server in the cluster as its standby server.
Detecting faults
For a standby server to become an Active server, it must try to determine that the Active server is no longer functioning properly. Typically, the system uses one of the following general types of heartbeat mechanisms to do this:
- send a signal. for sending signals, the Active server sends the specified signal to the standby server at a defined time interval. If the standby server does not receive a signal within a certain time interval, it determines that the active server has failed and has an active role. For example, the Active server sends status messages to the standby server every 30 seconds. Because of a memory leak, the Active server eventually runs out of memory and then crashes. The standby server notices that no status messages are received within 90 seconds (three intervals), so it takes over the work of the Active server.
- receive the signal. for receiving signals, the standby server sends requests to the active server. If the Active server is not responding, the standby server repeatedly sends this request for a specific number of times. If the active server is still not responding, the standby server takes over the work of the Active server. For example, an alternate server might send getcustomerdetails messages to the active server every minute. The active server eventually crashes due to a memory leak. The standby server sent a getcustomerdetails request three times but did not receive a response. At this point, the standby server takes over the work of the Active server.
A cluster can use multiple levels of signaling. For example, a cluster can use the send signal at the server level and use a set of receive signals at the application level. In this configuration, it sends a heartbeat message to the standby server whenever the active server is started and connected to the network. These heartbeat messages are sent at more frequent intervals (for example, every 5 seconds), and the standby server may be programmed to take over the work of the active server if only two heartbeat messages are not received. That is, the standby server detects this failure and starts the standby process within 10 seconds after the active server fails.
It is quite common that signals are sent through a dedicated communication channel so that network congestion and general network problems do not result in false failovers. In addition, the standby server may send query messages to one or more critical applications running on the active server and wait for a response within the specified time-out interval. If the standby server receives the correct response, it does not take any further action. To minimize the performance impact on active servers, application-level queries typically take longer periods of time, such as every minute or longer. The standby server may be programmed to wait until at least five requests have been sent but not received, before taking over the active server. This means that the standby server may not start the failover process for up to 5 minutes.
Synchronization Status
The standby server must first synchronize its state with the state of the failed server before it can begin processing transactions. There are three different methods of synchronization:
- The transaction log. in the transaction log method, the Active server logs all changes in its state to the log. A synchronization utility periodically processes this log to update the state of the standby server so that it is consistent with the state of the active server. When the active server fails, the standby server must use this synchronization utility to process any additions from the transaction log since the last update. After the state is synchronized, the standby server becomes the Active server and begins processing transactions.
- hot spare. in the hot standby method, updates to the internal state of the active server are immediately replicated to the standby server. Because the state of the standby server is a clone of the Active server state, the standby server can immediately become the Active server and begin processing transactions.
- shared storage. in a shared storage method, both servers record their status on a shared storage device, such as a storage area network or a dual-host disk array. In this way, failover can occur immediately because there is no need for state synchronization.
Identify the Active server
For a specified set of applications, there is only one Active server, which is extremely important. If more than one server is acting like an Active server, data corruption and deadlock are usually caused. A common way to resolve this problem is to use a variant of the active token concept. The token is a flag at its simplest level and is used to identify the server as an Active server for an application. There is only one active token for each group of applications, so only one server can have the token. When the server starts, it verifies that its partner owns the active token. If owned, the server is started as a standby server. If it does not detect an active token, it takes ownership of the active token and starts as the active server. When the standby server becomes the Active server, the failover process will hand the active token to the standby server.
In most cases, when the standby server becomes the Active server, it is transparent to the application or user it is supporting. If a failure occurs during a transaction, you may have to retry the transaction to make it complete successfully. This makes it more important to keep the failover process transparent when writing application code. An example of this is when data is submitted to the database, including retries and timeouts.
In addition, most servers use Internet protocol (IP) addresses for communication, so the infrastructure must be able to support the transfer of IP addresses from one server to another in order for the failover to succeed. For example, you can use a network switch that supports IP address transfer. If the system infrastructure does not support this transfer feature, you may need to use a load balancing cluster instead of a failover cluster. For more information, see load-balanced Cluster mode.
Extending the Failover Cluster Server
The scalability in a failover cluster is typically achieved by extending a single server within the cluster, or adding more functionality to it. It is important to understand the following two points: the failover cluster must be designed to handle the expected load, and the size of each server should be able to accommodate the expected growth in CPU, memory, and disk usage. Failover Cluster servers are typically high-end multiprocessor servers, and they are configured to use multiple redundant subsystems for high availability. If the resource requirements of the solution exceed the constraints of the server in the cluster, it is extremely difficult to extend the cluster.
Example
To help you better understand how to use failover clustering for high availability, the following discussion walks through the steps to refactor a basic solution that has already been implemented, which contains a single system, the point of failure, into a highly available solution.
Non-fail-over solutions
Initially, the organization may have only a basic solution architecture (for example, the architecture outlined in Figure 1). While the solution may meet the initial availability requirements, some factors, such as the increase in the number of users or the need for less application downtime, may force you to make changes to the design.
Figure 1: non-failover solution with a single point of failure
In Figure 1, the data tier contains only one database server (DATABASE10) that serves the application layer. If the database server or the software it is running fails, the application server will no longer be able to access the data that is used to serve the client. This will make the application unavailable to clients.
Failover Clustering Solution
To improve the availability of the solution, the organization may decide to eliminate the potential point of failure caused by a single database server in the data tier. To do this, you can add servers to the data tier and create failover clusters with existing database servers, new servers, and shared storage devices. In Figure 2, which describes the change, the cluster consists of two servers connected to the shared storage array.
Figure 2: solution with failover data tier
The first server (DATABASE01) is the Active server that handles all transactions. A second server (DATABASE02) that is idle will only process transactions if DATABASE01 fails. The cluster exposes a virtual IP address and host name (DATABASE10) on the network used by the client and the application.
Note: You can extend this design to include multiple active servers (except for the servers shown), either to share a single standby server, or to configure each Active server as a standby server for another Active server.
Result context
Failover Cluster mode has the advantages and disadvantages:
Advantages
- adapt to planned downtime. a failover cluster can allow the system to have downtime without impacting availability. In this way, it adapts to the daily maintenance and upgrade needs.
- reduce unplanned downtime. failover clustering reduces application downtime associated with server and software failures by eliminating the point of failure at the system and application levels.
Disadvantages
- will increase the response time. for failover cluster design, the response time is increased due to the increased load on the standby server, or the need to update the state information for more than one server.
increase equipment costs. the additional hardware required by a failover cluster can easily double the cost of the infrastructure tier.
7th. Performance and Reliability mode Failover Cluster (failover cluster)