Introduction
A physical failure, an operating system failure, or a SQL Server failure can cause a session between two availability replicas to fail. The availability replica does not periodically check the components on which Sqlservr.exe depends to verify that the components are working correctly or have failed. However, for some types of failure, the affected component will report an error to Sqlservr.exe. Errors reported by another component are called "hard errors." In order to detect other failures that may be overlooked, the always on availability group implements its own session-timeout mechanism. Specifies the session time-out period in seconds. This time-out period is the maximum time a server instance waits to receive a PING message from an instance before it considers disconnecting from another instance. When a session time-out occurs between two availability replicas, the availability replica assumes that a failure has occurred and declares a "soft error."
Failure caused by a hard error
Possible causes of hard errors include (but are not limited to) the following situations:
- Connection or network cable disconnected
- Network card fails
- Router changes
- Firewall changes
- Endpoint Reconfiguration
- The drive where the transaction log resides is missing
- Operating system or process failure
For example, if the log drive in the primary database stops responding or fails, the operating system notifies Sqlservr.exe that a critical error has occurred.
Some components, such as network components and some IO subsystems, use their own time-out settings to determine the failure. These timeout settings are independent of always on availability groups, which do not understand them and do not recognize their behavior at all. In these cases, a timeout delay increases the time between a failure and the availability replica receiving the resulting hard error.
Faults caused by soft errors
Scenarios that may cause session timeouts include (but are not limited to) the following:
- Network errors such as TCP link timeouts, packets being deleted or corrupted, or packet order errors.
- The operating system, server, or database is in a pending state.
- Windows server timed out.
- Insufficient compute resources, such as CPU or disk overload, transaction log filling, or system running out of memory or threads. In these cases, you need to increase the time-out period, reduce the workload, or replace the hardware to handle the appropriate workload.
Callback timeout mechanism
Because soft errors cannot be detected directly by the server instance, soft errors can cause an availability replica to wait indefinitely for the response of another availability replica in the session. To prevent this, always on availability groups implement the session-timeout mechanism, which is based on the following criteria: The connected availability replica sends pings at regular intervals on each open connection. Receiving a ping within the time-out period indicates that the connection is still open and that the server instance is communicating through this connection. When you receive a ping, the replica resets the timeout counter on this connection. The primary and secondary replicas ping each other to indicate that they are still active, and the session time-out limit is a user-configurable replica property with a default value of 10 seconds.
If a ping from another replica is not received within the session time-out period, the connection will time out, the connection will be closed, and a time-out copy enters the disconnected state. Even if it is a copy of synchronous-commit mode, the transaction will not wait for the replica to reconnect temporarily to switch the secondary replica to asynchronous-commit mode. After the secondary replica is reconnected with the primary replica, they will resume synchronous-commit mode.
Reference: https://msdn.microsoft.com/zh-cn/library/ff877884 (v=sql.120). aspx
Summary
Failure in a database other than the primary database could not be detected. In addition, it is unlikely that a data disk failure will be detected unless the database restarts due to a data disk failure, and a valid error check is performed on the availability replica only in the event of a soft error.
Note: pursuer.chen Blog:http://www.cnblogs.com/chenmh This site all the essays are original, welcome to reprint, but reprint must indicate the source of the article, and at the beginning of the article clearly give the link. Welcome to the exchange of discussions |
Possible failures during SQL Server AlwaysOn availability replica Sessions