標籤:
本周一晚上,收到一個資料庫鏡像SUSPEND和SYNCHRONIZING狀態來回切換的警示,看到這個錯誤,其實也蠻常見的。
我們的修複步驟一般是這樣的:
1. ALTER DATABASE *** SET PARTNER RESUME
如果這個步驟還不行,就需要觸發一個事務,因為偶爾我們遇到SYNCHRONIZING狀態的,可以觸發事務修複,其原因可能是鏡像由於網路原因冰了一下
2. CREATE TABLE dbo.tb_repair_mirroring_****(ID int), 然後刪除該表
當這個修複也是失敗後,我們就需要嘗試先拿掉鏡像,然後再重新應用一下,當然這個成功的前提必須保證MASTER/SLAVE的交易記錄是可以前滾的,也就是交易記錄鏈沒有丟失或者破壞
3. ALTER DATABASE *** SET PARTNER OFF, 然後重新應用ALTER DATABASE *** SET PARTNER = ‘TCP://***‘(類似語句)
4.上述若不能修複,剩下唯一的修複辦法是重建鏡像,直到現在,這個都是能夠解決問題的,可以說是最後一個招數
但遺憾的是,所有的方法都用了也沒能修複,還是不斷的出現這個錯誤。
errorlog主庫:
2014-12-17 15:57:30.640 spid33s Error: 1453, Severity: 16, State: 1.
2014-12-17 15:57:30.640 spid33s ‘TCP://***‘, the remote mirroring partner for database ‘***‘, encountered error 3624, status 1, severity 20. Database mirroring has been suspended. Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.
備庫:
2014-12-17 00:00:42.97 spid41s Error: 17066, Severity: 16, State: 1.
2014-12-17 00:00:42.97 spid41s SQL Server Assertion: File: <loglock.cpp>, line=807 Failed Assertion = ‘result == LCK_OK‘. This error may be timing-related. If the error p ersists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corr upted.
看起來像資料頁有問題,查了相關報錯又像是BUG,同時自身也產生了很多dump檔案,進一步分析情況:
主庫異常DB目前的TPS,3000左右
| begin tranDECLARE @value intDECLARE @value2 intselect @value=CONVERT(int,cntr_value) from sys.dm_os_performance_counters where object_name like ‘MSSQL$***:database%%‘and instance_name=‘***‘ and counter_name like ‘Transactions/sec%%‘ waitfor delay ‘00:00:01‘ select @value2=CONVERT(int,cntr_value) from sys.dm_os_performance_counters where object_name like ‘MSSQL$***:database%%‘and instance_name=‘***‘ and counter_name like ‘Transactions/sec%%‘ select @[email protected] commit tran |
主庫發送隊列在增加
| begin tranDECLARE @value intDECLARE @value2 intselect @value=CONVERT(int,cntr_value) from sys.dm_os_performance_counters where counter_name = ‘Log Send Queue KB‘ and instance_name=‘***‘; waitfor delay ‘00:00:01‘ select @value2=CONVERT(int,cntr_value) from sys.dm_os_performance_counters where counter_name = ‘Log Send Queue KB‘ and instance_name=‘***‘; select @value*1./1024 as first_second_MB,@value2*1./1024 as second_second_MB,(@[email protected])*1./1024 as diff_MB commit tran |
時而追上
備庫的應用隊列
| begin tranDECLARE @value intDECLARE @value2 intselect @value=CONVERT(int,cntr_value) from sys.dm_os_performance_counters where counter_name = ‘Redo Queue KB‘ and instance_name=‘***‘ waitfor delay ‘00:00:01‘ select @value2=CONVERT(int,cntr_value) from sys.dm_os_performance_counters where counter_name = ‘Redo Queue KB‘ and instance_name=‘***‘; select @value*1./1024 as first_second_MB,@value2*1./1024 as second_second_MB,(@[email protected])*1./1024 as diff_MB commit tran |
備庫的應用速度大概在 14MB
| begin tranDECLARE @value bigintDECLARE @value2 bigintselect @value=CONVERT(bigint,cntr_value)*1./1024/1024 from sys.dm_os_performance_counters where counter_name = ‘Redo Bytes/sec‘ and instance_name=‘***‘; waitfor delay ‘00:00:01‘ select @value2=CONVERT(bigint,cntr_value)*1./1024/1024 from sys.dm_os_performance_counters where counter_name = ‘Redo Bytes/sec‘ and instance_name=‘******scard‘; select (@[email protected]) as speed_MB commit tran |
假設忽略主庫新產生的日誌,追上主庫需要的時間 32min,如果再算上主庫新產生的大概在40min多
select 27338.278320/14/60 =32.545569428566
再排查
| select cntr_value,* from sys.dm_os_performance_counterswhere counter_name in(‘Log Send Flow Control Time (ms)‘,‘Bytes Sent/sec‘,‘Log Bytes Sent/sec‘,‘Log Compressed Bytes Sent/sec‘,‘Log Harden Time (ms)‘,‘Log Send Flow Control Time (ms)‘, ‘Log Send Queue KB‘,‘Mirrored Write Transactions/sec‘,‘Pages Sent/sec‘, ‘Send/Receive Ack Time‘,‘Sends/sec‘,‘Transaction Delay‘ ) and instance_name=‘***‘; |
Send/Receive Ack Time:
Milliseconds that messages waited for acknowledgement from the partner, in the last second.
This counter is helpful in troubleshooting a problem that might be caused by a network bottleneck, such as unexplained failovers, a large send queue, or high transaction latency. In such cases, you can analyze the value of this counter to determine whether the network is causing the problem.
問題是否可能出現在網路上,該主機TOA(網路組件)已經升級了,但是備機的TOA未升級
等待確認的時間穩定在800ms左右,對比升級TOA的主機在100ms左右,當然這個也跟對比執行個體的tps相關,但目前發現有問題的地方可能是這裡
| begin tranDECLARE @value bigintDECLARE @value2 bigintselect @value=CONVERT(bigint,cntr_value) from sys.dm_os_performance_counters where counter_name = ‘Send/Receive Ack Time‘ and instance_name=‘***‘; waitfor delay ‘00:00:01‘ select @value2=CONVERT(bigint,cntr_value) from sys.dm_os_performance_counters where counter_name = ‘Send/Receive Ack Time‘ and instance_name=‘***‘; select (@[email protected]) as ‘Send/Receive Ack Time‘ commit tran |
半小時候發現發送隊列又變大了,發送速度又是
備庫的應用隊列減小的也非常慢
後續我們跟使用者溝通了維護時間做了checkdb,但並未發現沒有壞頁,說明資料是ok的,那麼問題可能在日誌了。
恰好我們發現了資料日誌的一直較大,並且收縮資料庫日誌無效,查看等待的類型,發現是ACTIVE_TRANSATION, 一般,正常的情況是等待LOG BACKUP和DATABAE MIRRORING(只針對RDS環境,環境更複雜的,等待的類型更多) ,這樣我們就感覺有點眉目了。
log_reuse_wait_desc 為 ACTIVE_TRANSACTION
ACTIVE_TRANSACTION:事務處於活動狀態(所有復原模式)。
- 一個長時間啟動並執行事務可能存在於記錄備份的開頭。在這種情況下,可能需要進行另一個記錄備份才能釋放空間。
- 事務被延遲(僅適用於 SQL Server 2005 Enterprise Edition 及更高版本)。“延遲交易”是有效活動事務,因為某些資源不可用,其復原受阻。
首先我們查看是否有事務在阻塞日誌被截斷:
DBCC OPENTRAN(***‘)
發現了重大的線索,有個277號進程是阻塞元兇(使用者自己發起了一個事務),分析了這個277號進程,是使用者在查系統檢視表的指令碼,其中,一段是SELECT語句,其他的在BUFFER裡面差不多,總之就是有個事務在裡面
15號的事務今天19號,導致中間的日誌全是活動紀錄無法截斷,跟使用者確認kill掉後,再次 dbcc opentran(‘***‘)
發現結果一直在變化,可以理解為活動紀錄一直在往前走(越來越少),再次備份後活動紀錄恢複到7G左右,重搭成功。
終於在不影響使用者的情況下成功解決,我們差點就要讓他停業務做重啟了(還未必能重啟成功),也再次證明團隊的力量是巨大的,大家的細心是解決問題的關鍵。
https://connect.microsoft.com/SQLServer/feedback/details/412955/
http://stackoverflow.com/questions/4192730/sqlbulkcopy-mirroring-failed
http://msdn.microsoft.com/zh-cn/library/ms188290(v=sql.105).aspx
http://msdn.microsoft.com/zh-cn/library/ms189931.aspx
http://msdn.microsoft.com/zh-cn/library/ms176064.aspx
http://msdn.microsoft.com/zh-cn/library/ms178037(v=sql.105).aspx
一個非常罕見的SQLServer鏡像異常的排查處理過程