Original: SQL Server 2012 AlwaysOn--a problem caused by a hardware upgrade
This is a case of last week: The problem caused by the upgrade of the existing hardware, but also triggered a more serious bug, is disastrous, but fortunately, in a chain of problems arise, there is no manual error (this is often the highest risk in dealing with the failure, affect the biggest problem) and expand the scope of failure;
========================== Gorgeous split-line ==========================
Let's talk about the environment:
I'm doing it across the room. 3 Node AlwaysOn:
Deployment aspects: 3 nodes, two are located in the host room, synchronous mode, the other one is located in an offsite room, cross-subnet asynchronous mode;
Software: Windows 2012+sqlserver SP2+CU3;
Hardware: Because the system is online earlier, in addition to the local hard disk (RAID 10) to hold the necessary installation packages, each node is configured with a piece of IO card for data, log files and backup
Before the system was used, the application side often appeared commit transaction jitter (local computer room two node synchronization), to the asynchronous mode after the application side performance is good; we know that in synchronous mode, Since the application side waits for the synchronization secondary node to complete the log cure (Harden) to receive the commit or rollback information, the network environment between the two nodes and the disk IO capability become the key of the above impact;
Prior to this, we have optimized the network (see: SQL Server 2012 alwayson--Specify data synchronization links, eliminate commit delay caused by network jitter), so you can eliminate network impact; We monitor the performance of disk IO (especially the impact of checkpoint), eventually locating to disk IO is indeed a pressure, and finally decided to replace the IO card;
When we applied for the equipment, we found that since the previous IO card was the first generation, it had compatibility issues with the third generation products currently purchased (unable to install at the same time), so it was necessary to kick the secondary node out of the AlwaysOn environment first, re-initialize the data after reinstallation, and add back to the AlwaysOn environment; This step is carried out in accordance with standard steps;
Second, we are going to switch the AG to the node of the updated hardware (here we call him Node_b), and the result is that the switching process is smooth (manual failover), but cannot be backed up after switchover (due to the subsequent need to do the same update hardware operation of another node, The inability to back up means that the data cannot be initialized when it is re-added to the AlwaysOn environment, and then the service is cut back to Node_a (the original master node);
We then checked the errorlog of Node_b and found that the following error message appears:
Information in-Apr- the 3: -: -PM MSSQL$PRD9012Server there has been25958400misalignedLogIOs which required falling back toSynchronous IO. The CurrentIo is on fileW:\MOUNTLOG\PRDLOG\PRDLOG1.ldf. Information in-Apr- the 3: -: -PM MSSQL$PRD9012Server there has been25958144misalignedLogIOs which required falling back toSynchronous IO. The CurrentIo is on fileW:\MOUNTLOG\PRDLOG\PRDLOG1.ldf.
In fact, from the node_b replaced hardware, and add back to AlwaysOn environment, has been reported similar errors, just switch more smoothly, we have neglected to check errorlog this key step;
Continue to the above error message, misaligned is an IO direction for the alarm, the specific principle can refer to the following articles
http://blogs.msdn.com/b/saponsqlserver/archive/2014/10/02/ Message-misaligned-log-ios-which-required-falling-back-to-synchronous-io-in-sql-server-error-log.aspx
The cause of misaligned is due to the two-node IO card, whose physical sector size is inconsistent (node_a is 512,node_b 4096; Here the physical sector is the underlying setting of the storage device, and the format 4k~64k in the operating system is not a concept, The definition of the operating system format is the allocation unit size, or cluster. The above link in the 9012 error is analyzed in detail, no longer repeat it;
On the other hand, is it because of misaligned that the switch node cannot be backed up? The next day, I set up a similar environment to test, but the problem was not reproduced, so we are going to upgrade with another set of scenarios:
Since misaligned is caused by the physical sector size of two nodes in the AG, we are ready to add another node (Node_c) with a physical sector size of 4096 in the existing AG and then switch AG to Node_b after the node_a is kicked off. In this way, the nodes in the AG with two synchronization relationships (Node_a, Node_c, and physical sector sizes are 4096) may be able to implement backups.
========================== Gorgeous split-line ==========================
In accordance with the above plan, we arranged another stop. But this time after switching services and kicking off the node_a, not only the backup problem is not resolved, even the AG group has become an analysis of the situation
From this, only the current node can be identified in the AG Group;
However, Node_b can still be normal access (read and write normal, listener IP can be used normally), and node_c is inaccessible; This state is extremely unreasonable;
In addition, a large number of remote harden of transaction errors were found in the errorlog
Perform backup (spid=509) is blocked by checkpoint process (spid=23) and blocked by DB startup process (SPID=35)
According to the Microsoft Engineering Analysis, "This is a recently discovered SQL bug that only happens on SP2 CU3 and CU4. This kind of blockage can occur even if you do not do backup. ”
This may be due to a deadlock inside SQL Server, and it is recommended that you install the following patch on all nodes as soon as possible.
http://support.microsoft.com/en-us/kb/3033492
http://support.microsoft.com/en-us/kb/3034679
You can install hotfix individually, or install SQL Sp2cu5, and we recommend that you play CU4 for all SP2 CU3 (5556) and CU5 (5569) with AlwaysOn environments as soon as possible.
Http://support.microsoft.com/en-us/kb/3037255/en-us
However, the current situation is to ensure that AlwaysOn is back to normal, so we are prepared to migrate the database to other AlwaysOn environments by halting the replication of data files, but hang on when the SQL Server service is stopped
Helpless, only restart the server. But the magic is that restarting Dafa is the perfect solution here. After the restart, all services are returned to normal;
Summary: This case is very special, encountered another bug during the switchover, but fortunately, the internal process deadlock in the bug was released by rebooting. In addition, for the first part of the misaligned problem, it is best to install the hardware, first check the size of the physical sector is consistent, to avoid performance problems;
SQL Server 2012 AlwaysOn--a problem caused by a hardware upgrade