Monitoring SQL Server Transactional replication
In general, we can use the Replication Monitor in SSMs to monitor replication. But we can't stare at it for 24 hours, and we have to use an automated way to monitor it. Microsoft provides system stored procedure DBO.SP_REPLMONITORSUBSCRIPTIONPENDINGCMDS in the distribution database to return the number of commands waiting on a subscription, as well as the estimated time required to post all these commands to subscribers. I created a job that runs every 10 minutes, saves the state's history data to a table, and retains the data for 14 days.
This table is created in the DBA database of the Subscriber server with the following code:
CREATE TABLE dbo. Replication_qu_history (
subscriber_db varchar () NOT NULL,
Records_in_que numeric (0) NULL,
Catchuptime numeric (0) NULL,
logdate datetime not NULL,
CONSTRAINT pk_epr_replication_que_history PRIMARY KEY CLUSTERED
(
subscriber_db ASC, logdate DESC
) on PRIMARY
GO
Table data is generated through monitoring stored procedures and can be used to find problems with historical data. However, it is more necessary to monitor what is happening now.
There are three things that can help determine the health of a replication.
1. Copy the status of the related job.
2. Delay, especially the distribution delay measured by the counter dist:delivery latency.
3. Number of non-executed commands waiting for subscription.
I focused my attention on the distribution delay, because from past experience, the problem of distribution delay was more pronounced than the log read delay. Most of the time, the distribution delay is due to an increase in transaction volume. For example, an index rebuild on a large table that publishes data causes a sudden increase in transaction log volume, resulting in more data than is expected to be replicated.
If you have a large number of commands waiting to be distributed, it is sometimes possible that the Distribution Agent job is not running. On the other hand, sometimes this job is running, but not keeping up. By restarting the agent, the job starts processing the commands that are not executed.
Before we start, we need to know the information about replication, such as the name of the Publisher and Subscribers, the name of the Distribution Agent job, and so on. Microsoft provides a number of stored procedures in the distribution database to collect this information. The author's distribution database is together with the Subscriber database, so the script is simpler than the different servers.
1. First, perform sp_replmonitorhelppublisher in the distribution database to obtain monitoring information for all publishers.
2. Then, perform sp_replmonitorhelppublication on the distribution database to return monitoring information for all publications.
3. Finally, execution sp_replmonitorhelpsubscription returns monitoring information for all subscriptions.
This information contains some time-lapse indicator data, so after executing this stored procedure, I have some key information.
The following is the code used to gather information:
declare  @cmd  nvarchar (max) declare  @publisher  SYSNAME,  @publisher_db  sysname,   @publication  SYSNAME,  @pubtype  int declare  @subscriber  SYSNAME,  @subscriber _db sysname,  @subtype  int declare  @cmdcount  INT,  @processtime  int DECLARE   @ParmDefinition  nvarchar declare  @JobName  sysname declare  @minutes  int,   @threshold  INT,  @maxCommands  INT,  @mail  char (1)  =  ' N ' set @ minutes = 60 --> define how many minutes latency before  you would like to be notified set  @maxCommands  = 80000 --->  change this to represent the max number of outstanding  Commands to be proceduresed before notification set  @threshold  = @ Minutes * 60 select * into  #PublisherInfo from openrowset (' SQLOLEDB ',  ' server= (LOCAL); Trusted_connection=yes; ',  ' set fmtonly off exec distribution.dbo.sp_ Replmonitorhelppublisher ') select  @publisher  = publisher FROM  #PublisherInfo set @ cmd =  ' select * into # #PublicationInfo  from openrowset (' SQLOLEDB ', ' Server= (LOCAL); Trusted_connection=yes ', ' set fmtonly off exec distribution.dbo.sp_ replmonitorhelppublication  @publisher = ' +  @publisher  +  ') '--select  @cmd exec sp_ executesql  @cmd select  @publisher_db =publisher_db,  @publication =publication,  @pubtype = publication_type from # #PublicationInfo set  @cmd  =  ' select * into ## Subscriptioninfo from openrowset (' SQLOLEDB ', ' server= (LOCAL); Trusted_connection=yes ', ' set fmtonly off exec distribution.dbo.sp_replmonitorhelpsubscription  @publisher = ' +  @publisher  +  ', @publication_type = '  + convert ( CHAR (1), @pubtype)  +  '--select  @cmd exec sp_executesql  @cmd alter table ## Subscriptioninfo Add pendingcmdcount int null, Estimatedprocesstime int null
After you know the basic information about the Publisher and the Subscriber, then check the status of the distribution job. They should be running all the time. If it is not running, you have to start it. If I need to restart a job, I will set the identity to force send message alert.
I'm not trying to send an email alert just to check the status of all subscriptions. If the set of data exceeds the set threshold, the message alarm will be triggered. I use a cursor to iterate through all the subscriptions, which is the easiest way to gather information. I use this information as a parameter to other stored procedures to determine whether the Distribution Agent is running and to restart the agent.
Declare cur_sub cursor read_only for select  @publisher, s.publisher_db,  S.publication, s.subscriber, s.subscriber_db, s.subtype, s.distribution_agentname from  # #SubscriptionInfo  s open cur_sub fetch next from cur_sub into @ publisher,  @publisher_db,  @publication,  @subscriber,  @subscriber_db,  @subtype,  @ JobName while @ @FETCH_STATUS  = 0 BEGIN set  @cmd  =  ' select  @cmdcount = pendingcmdcount,  @processtime =estimatedprocesstime from openrowset ("SQLOLEDB", "server=" (LOCAL ); Trusted_connection=yes ', ' set fmtonly off exec distribution.dbo.sp_ replmonitorsubscriptionpendingcmds  @publisher = '  +  @publisher +  ', @publisher_db = '  +   @publisher_db  +  ', @publication = '  +  @publication +  ', @subscriber = '  + @ subscriber +  ', @subscriber_db = '  +&nbSP; @subscriber_db +  ', @subscription_type = '  + convert (CHAR (1), @subtype)  +  '; '  +  ' set  @ParmDefinition  = n ' @cmdcount  int output, @processtime  int  output '--select  @cmd exec sp_executesql  @cmd, @ParmDefinition, @cmdcount  output, @ Processtime output update # #SubscriptionInfo set pendingcmdcount =  @cmdcount,  estimatedprocesstime =  @processtime where subscriber_db =  @subscriber_db insert  Into dba.dbo.replication_que_history VALUES (@subscriber_db,  @cmdcount,  @processtime,  GETDATE ())-- find out if the distribution job with the high  Number of outstanding commands running or not-- if it is  Running then sometimes stopping and starting the agent fixes the  issue If exisTS (select * from tempdb. Information_schema. tables where table_name like  ' # #JobInfo% ') drop table # #JobInfo set  @cmd  =  ' select * into # #JobInfo  from openrowset (' SQLOLEDB ', ' server= (LOCAL); Trusted_connection=yes ', ' set fmtonly off exec msdb.dbo.sp_help_job  @job_name = ' ' +   @JobName  +  "", @job_aspect = ' job ' ") ' exec sp_executesql  @cmd if  @cmdcount  >  @maxCommands  OR  (@processtime  >  @threshold  AND  @cmdcount  > 0) BEGIN if  (select current_execution_status from # #JobInfo)  = 1
 -- this means job is currently executing so stop/start it BEGIN exec distribution.dbo.sp_msstopdistribution_agent @publisher  =  @publisher,  @publisher_ db =  @publisher_db,  @publication  =  @pubLication,  @subscriber  =  @subscriber,  @subscriber_db  =  @subscriber_db WAITFOR  DELAY  ' 00:00:05 '  ---- 5 second delay set  @mail  =  ' Y ' End end-- select name, current_execution_status from # #JobInfo if  (select current_execution _status from # #JobInfo)  <> 1 -- if the job is not  running start it BEGIN exec distribution.dbo.sp_msstartdistribution_agent @publisher  =   @publisher,  @publisher_db  =  @publisher_db,  @publication  =  @publication,   @subscriber  =  @subscriber,  @subscriber_db  =  @subscriber_db set  @mail  =   ' Y '  -- send email if job has stopped and needed to  be restarted END drop table # #JobInfo fetch next from cur_sub into  @publisher, &NBSP; @publisher_db,  @publication,  @subscriber,  @subscriber_db,  @subtype,  @JobName END Close cur_sub deallocate cur_sub
Run Sp_replmonitorsubscriptionpendingcmds to collect the command that was not executed and the expected time to keep up.
This is the information I want to store in the history table, so I can see how replication is performing.
We need to determine an acceptable latency threshold. I use it here for 6 minutes, meaning that if the replicated database is less than 6 minutes behind the publication database, it will be alerted. Also determine the maximum number of non-distributed commands. If this amount fluctuates upward, there may be a problem. You can choose to take action when the number is set to high. I chose to have 80,000 non-distributed commands for this system.
I acquired this data after having the Copy queue check job run for two weeks. Make sure that these jobs run like index rebuild jobs. I looked at the maximum number of non-distributed commands and the maximum latency for some time and determined that my settings would be larger. I don't want to be woken up at night because of a temporary system backup caused by the index rebuild job, which is automatically restored.
The following code requires the Enable AD Hoc distributed Queries server configuration option. Assuming the previous script found the problem, I created the script to send the message.
if  @mail  =  ' Y ' BEGIN declare  @msg  varchar (MAX)  =  ' replication on  '  + @ @SERVERNAME +  '  may be experiencing some problems. attempts  to restart the distribution agent have been made.  ' +  ' If  this is not the first message like this that you have
 received within the last hour, please investigate. ' declare  @body  nvarchar (max) declare  @xml1  nvarchar (max) declare  @tab1  nvarchar (max ) declare  @xml2  nvarchar (max) declare  @tab2  nvarchar (max) set  @xml1  = cast ((  SELECT subscriber AS  ' TD ', ',subscriber_db as  ' TD ', ', latency as  ' TD ' , ', pendingcmdcount as  ' TD ', ', estimatedprocesstime as  ' TD ' FROM ## Subscriptioninfo s FOR&NBSP;XML&NBsp PATH (' tr '), elements )  as nvarchar (MAX)) set  @tab1  = ' Finally, you need to periodically delete the data from the Replication state table so that the data is not too old.
DECLARE @delDate datetime = GETDATE () -10
DELETE from DBA.dbo.Replication_Que_History
WHERE logdate < @ Deldate
If any of the thresholds configured in the script match, the publishing agent associated with the subscription of the problematic counter will be restarted, and if it has stopped, the job will start. You will be notified of this action by email. In many cases, restarting the Distribution Agent resolves the issue, and the replication begins to work. If the problem is still not fixed, the next time the job runs the same action, another message is received. You need to start checking the situation.
You can call the 3rd script in your alarm system and restart the Distribution Agent job when any threshold matches. Or, run a 1th script to create the table. Create a new job, run the next 3 scripts in step 1th, and then place the 5th script in step 2nd. I currently run this schedule every 10 minutes.
This process is primarily intended to help deal with intermittent downtime of transactional replication. It is still important to monitor the replication process regularly using Replication Monitor. This process is just to stop phone harassment during off hours and can be repaired only by starting the Distribution Agent job.
This article is from the SQL Server deep Dive blog, so be sure to keep this source http://ultrasql.blog.51cto.com/9591438/1907283
Monitoring SQL Server Transactional replication