--==============================================================
Estimate is the last time before the Spring Festival blog, also estimated to be the last time this year on duty thunder, exclamation into SQL Server, also failed SQL Server.
--==============================================================
Scenario Description:
Operating system version: Windows Server 2012 Datacenter Version
Database version: SQL SERVER 2012 Enterprise Edition, version number: 11.0.5582.0
Problem Description: Database configuration AlwaysOn environment, the same machine Room 2 node synchronization automatic switching + cross-room asynchronous, high-availability automatic failover, because there are four nodes, so select the odd 3-node cluster quorum, but when one of the nodes (quorum node or non-quorum node) a hardware failure causes a restart, it is possible "Thrown" between the clusters, the mushroom loses communication, and then the cluster begins to "Remove cluster node XXX from active failover cluster Membership" on each cluster node, eventually the cluster deletes all the quorum nodes, the cluster itself hangs, the cluster fails, causing the upper-dependent AlwaysOn to fail to service, in the " Resolving "status until the restarted node resumes normal ==> cluster normal = =" AlwaysOn normal.
Suppose there are ABCD four nodes, AB and CD respectively in two room, the ABC three node is configured as the quorum node, C node fails, discovered from cluster time:
The ABC three node was removed successively from the failed cluster, and then the quorum lost Cluster service was shut down.
--=====================================================================
According to Ms Expert's analysis, suspected network problems, event 1135 is also clearly indicated that due to network problems, and the computer room also found that some of the servers in this type of failure have used the problematic AOC cable.
However, the problem is always, but after that, why does the network like to go out in the fun when the server goes down? A set of Windows failover, when the problem does not occur, more than a year without network problems, it happens that the server down when the network "jitter" it? Because of the excitement of server outages or the fear that causes jitter?
The same computer room network should be more trustworthy, a remote computer room server downtime caused the same room network jitter is not too scientific.
--=====================================================================
Another error message is: A and the hanging C handshake does not complete the handshake within 40 seconds
Is there such a heavy relationship between cluster nodes? Wait a long time to shake hands with a hanging node? Do you want to wait until the end of time?
Under Popular science, if a similar situation occurs, if the server in which the outage occurs cannot be successfully restarted as soon as possible, the net stop clussvc can be used to stop the local cluster server if the failover cluster does not start properly, and then use the net start clussvc/ FQ to force the local Cluster service to start so that AlwaysOn replies are serviced as soon as possible.
--====================================================================
Some of the less reliable suggestions for your reference:
1. For the cross-room of the arbitration node, can not use it, it is not in the same room to get a server to do file-sharing arbitration is also OK
2. Two-node failover cluster, be sure to configure the file share or disk witness
3. In the cluster properties, in the Policy column, try to configure the maximum number of restarts in the specified period:
--====================================================================
Under the Groove, AlwaysOn known as the second level of failover ah, very tempting, indeed a lot of times this makes the DBA very relieved, received the fault message, has been automatically transferred and restored to provide services, the DBA can safely take a bath to brush a tooth change clothes to deal with the failure. But the ideal is beautiful, the reality is cruel, ao most of the situation is still to force, the probability of a bug can not be normal switch (note is lower not no), but jiabuzhu the Windows Failover cluster, the foundation is not good, building again strong also easy collapse Ah!
Expect SQL Server to rise again and expect as a SQL Server DBA to be proud to say "SQL Server, definitely no problem" as before.
Another year approaching, look around a small partner a anxious, suddenly afraid of the new year, mixed good already quietly sleep, mixed poor already insomnia into habit.
Wishing you all friends a happy Spring Festival, money and no money, go home for the New Year!
See you next year and fight again in the coming year!
SQL server--difficult incurable diseases of the pit Dad Windows Failover cluster