Production environment:
SQL Server 2008 R2 active/passive nodes,windows Server 2008 R2 SP1 Cluster, vsphere 5.x
Occurrence start
6 am received Application Team report BizTalkMsgBoxDb Enter suspect mode and cannot be accessed.
Reporting events to reduce user pressure
Simple and App Manager phone, to understand their apps level down time, in the ticket input of the approximate occurrence, event description, there has been no recent changes in the event. If there is no ticket system, please email to the relevant personnel. The benefit of the phone Incident manager managing all event updates is to make panicked people know what's going on and reduce their stress.
Tidy yourself up.
6:30 am a lot of people's phone always let oneself nervous, simple brainstorm the reason why the suspect may occur: file group (data and log) of the damage? Disk full/san error? Backup still there?
Check the error Log to locate the starting error message
6:40 am finds the initial error that occurred 1 minutes after the successful log backup, and the error message shows that OS error caused the LogWriter log flush (write log) to fail. Failure to write a log can result in data suspect.
2014-03-17 03:15:56.05 spid5s error:17053, Severity:16, State:1.
2014-03-17 03:15:56.05 spid5s logwriter:operating system error1117 (failed to retrieve text for this error. REASON:15105) encountered.
2014-03-17 03:15:56.05 spid5s Write Error during log flush.
2014-03-17 03:15:56.05 spid79 error:9001, severity:21, State:4.
2014-03-17 03:15:56.05 spid79 The log for database ' BizTalkMsgBoxDb ' isnot available. Check the event log for related error messages. Resolve anyerrors and restart the database.
2014-03-17 03:15:56.05 spid85 error:9001, severity:21, State:4.
Parse Error:
1117 OS error, about disk. The log file is still in and the disk is not full. You can consider migrating to log file.
First attempt to DBCC Repair
(The basis of any attempt is to understand that your actions do not make things worse)
Command ALTER DATABASE [Xxxxxx]set emergency;
Command error, database locked, cannot alter DATABASE, directly discard DBCC CHECKDB (N ' xxxxxxx ', Repair_allow_data_loss) with No_infomsgs, ALL_ERRORMSGS;
Why to give up: DBCC Repair requires the database to be in emergency mode, and it will attempt to restore the database to consistency (consistent recover) using the existing log. If the log has a problem, it rebuilds the log (which is personally thought to be the meaning of repair allow data loss). For a database of more than GB, rebuild log can take hours, given the recovery time object ( RTO) and SLAs (Service level agreement) do not allow the database to downtime for a long time (hindsight). Fortunately, it is not possible to alter DATABASE, where the error message directly indicates the DB log locked, implying that the log may not be corrupt, then there is no need to worry about DBCC repair.
After rethinking, arbitrarily think log file corrupted is wrong, DBCC repair as the first step of methodology is also outdated, should not confirm to the user whether can lose the past 15 minutes of active transaction ( Although the customer is still sleeping) (every 15 minutes of transaction log backups), not to mention it will make the database downtime longer, 8 points before work may not be restored, may not have database backup restore fast. The first step as a methodology should be to first confirm whether file corrupted and contact the server team for IO exceptions.
Second attempt to migrate log files
When encountering resource lock problems, the usual first reaction is to kill or reboot the resource. This is limited to their own skills or did not establish the right methodology, the first time to find the lock resources, so chose to restart the resource
Should be Windows Cluster, so do not detach/attach the database, directly failover to passive server, the database after failover equivalent restart and instance recovery. The log file is now writable and the database is restored to active.
temporarily resolves the problem and then moves the database switch over to the original Active server. There is no error, proving that it is not a problem with the disk itself. There may be a problem with the disk interface. Also viewed the Event Viewer except log backup found no other. Sp_who2 also found no suspicious database lock that excludes the DB process from locking the database or logfile.
Building problems
7 am to have the server team check the disk and suspect an Eva San problem. Now we know only the starting error and the workaround. As a problem, leave problem manager to continue to go further, to avoid the same problem in the future.
Summary: Encountered log file caused by the database hangs, the solution is first (1) to confirm the disk problem, then (2) Confirm the database process lock, and then (3) Confirm whether the corrupt, these check up is finished and then for (1) (2) (3) to propose a resolution Scheme. The higher the severity from (1) to (3), the higher the likelihood of data loss after recovery. To confirm the risk of online repair with the customer. The final straw is naturally a complete data backup scheme and a regular database recovery execution plan.