SQL Server always on FCI cluster node concurrently occupying resources and suspicious State repair

Last Update:2017-04-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

FCI Two-node cluster, because the network outage between the cluster nodes at night, two nodes are considered another node outage, in the cluster management of each node has seen the other side has been down.

Connect to the cluster IP to indicate a problem with the msdb database:

Discovery of the msdb database "suspicious"

Msdb is damaged, MSSQL error logs and agent logs are also not in the method of inquiry, from Windows to see the following information:

SQL Server Assertion: file: <xdes.cpp>, line =3785 failed Assertion = ' curr->getxdesid () = = M_xdesid '. This error may be time-related. If the error persists after you rerun the statement, use DBCC CHECKDB to check that the database is structurally complete, or restart the server to ensure that the data structure in memory is not corrupted. An error occurred at the logging ID (106502:3622:2) when undoing the operation under Log records in database ' msdb '. Typically, this particular failure was previously logged as an error in the Windows event Log service. Restore the database or file using a backup, or repair the database. An error occurred while processing the log for database ' msdb '. If possible, restore from a backup. If you do not have a backup available, you may need to regenerate the log. The database msdb was closed because error 3314 occurred in the routine ' xdesrmreadwrite::rollbacktolsn '. After all connections to the database are aborted, an attempt is initiated to restart the non-snapshot database. An error occurred at the logging ID (106502:3614:1) when undoing the operation under Log records in database ' msdb '. Typically, this particular failure was previously logged as an error in the Windows event Log service. Restore the database or file using a backup, or repair the database. The log scan number (106502:3144:155) passed to the log scan operation in database ' msdb ' is invalid. This error may indicate data corruption, or the log file (. ldf) does not match the data file (. mdf). If this error occurs during replication, recreate the publication. Otherwise, if the problem causes an error during startup, restore from a backup. An error occurred during recovery, causing the database "msdb" (4:0) to fail to restart. Please diagnose and correct these recovery errors, or restore from a known good backup. If you cannot correct the error, or if it is an unexpected error, contact your support personnel.

Suspicious to determine the msdb database log is corrupt.

SQL Server also has some other important error messages, such as automatic monitoring of traces, major system errors, and connection errors, as seen below:

DECLARE @path NVARCHAR () SELECT @path =path from sys.traces WHERE  id = 1SELECT * from:: fn_trace_gettable (@path, 0)

2017-04-07 01:21:36.05 spid95 Error: 3624, severity: 20, State: 1. 2017-04-07 01:21:36.05 spid95 A System Assertion Check has failed. Check the SQL Server error log for details. Typically, an assertion failure are caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during Setup, a mini dump would be sent to Microsoft. An update might is available from Microsoft in the latest Service Pack or in a QFE from Technical. 2017-04-07 01:21:36.07 spid95 Error: 3314, severity: 21, State: 3. 2017-04-07 01:21:36.07 spid95 During Undoing of a logged operation in database ' msdb ', an error occurred at log Recor D ID (106502:3622:2). Typically, the specific failure is logged previously as a error in the Windows Event Log service.  Restore the database or file from a backup, or repair the database.2017-04-07 01:21:37.59 spid95 error: 9004, severity: 23, State: 6. 2017-04-07 01:21:37.59 spid95 An error occurred wHile processing the log for database ' msdb '. If possible, restore from backup. If a backup is not available, it might being necessary to rebuild the log.2017-04-07 01:21:39.11 spid95 error: 9004, severity: 23 , state: 6.  2017-04-07 01:21:39.11 spid95 An error occurred while processing the log for database ' msdb '. If possible, restore from backup. If a backup is not available, it might being necessary to rebuild the log.2017-04-07 01:21:40.64 spid95 error: 9004, severity: 23 , state: 6.  2017-04-07 01:21:40.64 spid95 An error occurred while processing the log for database ' msdb '. If possible, restore from backup. If a backup is not available, it might being necessary to rebuild the log.2017-04-07 01:21:40.66 spid95 error: 3314, severity: 21 , state: 5. 2017-04-07 01:21:40.66 spid95 During Undoing of a logged operation in database ' msdb ', an error occurred at log Recor D ID (106502:3614:1). Typically, the specific failure is logged previously as a error in the Windows Event Log service. Restore the database OR file from a backup, or repair the database.2017-04-07 01:21:41.43 spid22s error:9003, severity:20, state:6.2017-0 4-07 01:21:41.43 spid22s The log scan number (106502:3144:155) passed to log scan in database ' msdb ' was not valid. This error may indicate data corruption or, the log file (. ldf) does not match the data file (. mdf). If This error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.2017-04-07 01:21:41.43 spid22s Error : 3414, severity:21, state:1.2017-04-07 01:21:41.43 spid22s An error occurred during recovery, preventing the Databa Se ' msdb ' (4:0) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors is not corrected or expected, the contact Technical support.

main states: 3624, 3314, 9004, 3414
The basic reasons are as follows:

Troubleshooting Error 3313, 3314, 3414, or 3456 (SQL Server)

How to troubleshoot Error 9004 in SQL Server

Due to msdb log backups (60+gb! found in msdb data file )

Now there are problems with the cluster, the use of resources, the heartbeat does not work. Connect to the database internal view node can be connected to one of the normal.

SELECT * FROM Sys.dm_os_cluster_nodesselect * from Fn_virtualservernodes ();

More importantly: Because the storage is shared, both the system database and the user database are shared!

Two nodes with shared storage, supposedly the data is consistent. In order for the cluster to return to its normal state, it intends to transfer the cluster. Restarting the server is better, instead of having to move it one by one, so that it automatically transfers all resources.

After reboot!!

MSDB is normal!!

But there are 3 user databases that appear "suspicious"!!

No way, the appearance of the only repair it!! Set "emergency" mode repair, set "single user", result settings are deadlock, unable to execute!

It is difficult to recover multiuser mode after setting up single user mode, it seems that there are continuous processes to execute!

Simply turn off the cluster another node server (virtual machine)! It is still the same mistake to come in.

Then no cluster connection, to the node server execution, or the same!!

Then start the service with the dedicated administrator DAC to see who can connect! When you come in and you always prompt another user to run, setting the database to multi-user mode is not possible!

Good! Then change the port and restart the MSSQL service to see who is still connected! No one's robbed! Suspicious operation will not deadlock, and no one else to operate, at this time can be repaired!

Two smaller (60GB and 1GB) databases no problem; DBCC CHECKDB repair process, there is a 170GB database repair to tempdb space is not enough! Fixed the part, and stopped.

But soon, because some database files are on the same disk, disk space is not enough!! Automatically stop the MSSQL service!

All right! Delete some database dump/log files, start the service detach some unimportant database, remove it (may no longer be used), move out of the shared disk.

What the! Insufficient permissions, local administrators can not move files! Then add the permissions of the data file to the local administrator individually!

Space is enough for business, but there is still a database that hasn't been repaired!! Then move the staging database to the local disk for DBCC CHECKDB repair!

After the repair is complete, change tempdb back to the shared storage location and set it to a smaller number!

Repair complete on this cluster node!!!

However, the data may have been lost and the cluster has not dared to recover.

ALTER Database dbname Set EMERGENCY goalter database dbname set Single_user with ROLLBACK immediategoalter database dbname Set Single_user godbcc CheckDB (dbname, Repair_allow_data_loss) go--after then:alter DATABASE dbname set Multi_user GO

SQL Server always on FCI cluster node concurrently occupying resources and suspicious State repair

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More