My kick for AlwaysOn performance monitoring

Source: Internet
Author: User

Latency is one of the most important enemies in AlwaysOn

Latency is one of the biggest enemies of AlwaysOn. For AlwaysOn, the primary goal is to minimize (unavoidable) the data latency of the primary replica, secondary replicas, and to achieve "data synchronization" of the primary replica and secondary replicas. The lower the synchronization latency of the primary replica and the secondary replica, the higher the reality of read-only access, and the smaller the RTO (estimating Failover Time) and RPO (Estimating potential data Loss) of the database.

However, latency may exist in all aspects of AlwaysOn synchronization, so when analyzing the current delay, you should first understand the AlwaysOn synchronization process and then slice into each process for monitoring and analysis.

6 Big steps for AlwaysOn sync

In my previous article, "AlwaysOn synchronization Principle and Synchronization mode", the synchronization process of AlwaysOn was introduced. It boils down to the following six steps:

①log Flush (primary)

②log Capture (primary)

③send (primary and secondary)

④log Receive and cache (secondary)

⑤log Hardened (secondary)

⑥redo (secondary)

The first two steps occur in the primary replica, the last three steps occur in the secondary replica, and the third step in the middle occurs between the primary replica and the secondary replica.

In addition, if it is synchronous-commit mode, one more step is required: After step 5, a secondary replica sends a (log hardening) acknowledgment to the primary replica before it can enter the redo phase.

Monitor AlwaysOn synchronization process log Flush (Primary)

Log flush simply brushes the logs in log buffer into disk. In SQL Server, the size of log buffer is fixed at 60KB, and the log is flush on the disk when a transaction commit occurs, checkpoint, or if buffer has insufficient free space.

AlwaysOn only the logs to the primary replica are brushed into the disk before continuing with the steps (in order to protect the data consistency of the primary replica). Therefore, the faster log flush is, the sooner the AlwaysOn next step enters, the smaller the delay, and vice versa.

So, how do you monitor the speed and performance of log flush in this phase? Typically, we use the following two performance counters:

    • Avg. Disk Sec/write

This is a disk performance counter, this indicator reflects the flush during the write operation of the average response time, if within 10ms, the disk performance is very good, if the 10ms to 20ms, the performance is good, if the 20ms to 50ms shows poor performance, if the 50ms above, Indicates poor performance.

    • SQL server:database > Log bytes flushed\sec

This indicator directly reflects the size of the log flush per second, because the size of the log in different time periods, different services may be different, so it is not possible to provide a standard value to measure the performance of the flush, but when this value is very large, the database operation (increase, delete, change) more frequent, need to attract attention , which is analyzed together with other indicators in the next steps.

log Capture (primary)

Log capture occurs after log flush and is the most important stage in AlwaysOn (personally). At this stage, a send queue is maintained on the primary replica for each secondary replica, the contents of which are logs buffered from the log or capture from the disk, and the size of the queue reflects the difference between the primary replica and the secondary replica data being out of sync.

Obviously, the two key factors that are most likely to affect AlwaysOn synchronization performance at this stage are the source of the capture log (disk or memory) and the size of the queue.

In both of these factors, I need to focus on the source of the capture log: disk or memory? We are self-directed, reading data from memory is much more efficient than reading from disk, so in AlwaysOn, SQL Server buffers as many logs as possible into memory. However, the size of the buffer is limited, if the log volume is too large, the buffer size is insufficient, you have to read the disk.

Let's familiarize ourselves with how to monitor the source of logs and the size of the send queue:

Source of monitoring logs:

Whether the log is read from memory or disk, the following performance counters can be glimpse:

    • SQL Server:Databases:Log pool requests/sec and SQL Server:memory Manager:log pool Memory (KB)

Explanation: The former represents the number of requests for log blocks from memory per second, the latter representing the size of the log pool memory;

Note: For these two values, we hope the higher the better, the more logs can be read from memory.

    • SQL Server:Databases:Log Pool Disk reads/sec and SQL Server:Databases:Log pool Cache misses/sec

Explanation: The two performance counters actually indicate that the log to be read is not found in memory and must be read from disk;

Note: If the two value is higher, the memory performance is insufficient, and SQL Server cannot buffer more logs.

Monitoring log Send queue:

Execute the following statement on the primary replica:

SELECT Ag.name as Ag_name, ar.replica_server_name as Ag_replica_server, dr_state.database_id as database_id,

Dr_state.log_send_queue_size, is_ag_replica_local = case

When ar_state.is_local = 1 then N ' Local '

ELSE ' REMOTE '

END,

Ag_replica_role = case

When Ar_state.role_desc was NULL then N ' Disconnected '

ELSE Ar_state.role_desc

END

From ((Sys.availability_groups as AG joins Sys.availability_replicas as ar on ag.group_id = ar.group_id)

JOIN sys.dm_hadr_availability_replica_states as ar_state on ar.replica_id = ar_state.replica_id)

JOIN sys.dm_hadr_database_replica_states dr_state on

ag.group_id = dr_state.group_id and dr_state.replica_id = ar_state.replica_id;

Log_send_queue_size represents the amount of log records (in kilobytes) that have not been sent to the secondary database in the primary database, that is, the send queue mentioned above, and if the sending queue for a secondary replica continues to increase, the greater the amount of data that this replica synchronizes with the primary replica.

In the example above, Server1 is the primary replica and Server2 is the secondary replica. The primary replica's log_send_queue_size is always null,server2 with a log_send_queue_size of 154067KB. Obviously, at this point Server2 obviously lags behind the primary replica.

Send (primary and secondary)

The Send phase is the simplest step in the AlwaysOn synchronization process (personally). In this phase, we focus primarily on network performance, because the primary replica must have a network to transfer the logs of the send queue to the corresponding secondary replicas. If the network is not good, AlwaysOn will incur a delay, and more, if it is in synchronous-commit mode, it will cause the primary replica's transaction to be delayed.

The monitoring of the network mainly through the following two performance counters:

    • Network interface:bytes sent/sec

Explanation: The number of bytes sent per second by the NIC;

Note: For this indicator we can not isolate, we need to combine the size of the sending queue, when the sending queue is large, but this performance indicator continues to be low, there may be a network problem (of course, if it is synchronous mode, you need to consider the speed of the auxiliary copy log harden to determine whether it must be related to the network).

    • Network interface:output Queue Length

Explanation: The sending queue size of the network card;

Note: In general, the network card queue is larger than 2 o'clock, indicating that the networks are congested, there may be network problems.

Log Receive and Cache (sencondary)

Log receive and cache occur on the secondary replica, indicating that the secondary replica accepts the log blocks sent from the primary replica and buffers them.

This phase is actually related to the network speed and memory size, both of which are described above, there is no too much elaboration, only to introduce you to the next phase of the AlwaysOn proprietary performance counters.

    • SQL server:availabiltiy Replica > Log Bytes received/sec

Explanation: The log size accepted per second in kilobytes (KB);

Description: The faster the acceptance, the better the network performance and memory performance.

Log Hardened (sencondary)

When I introduced log hardened, I had to talk about the first step of log flush, which is actually a thing, just a different expression. Because they are very similar in terms of their working principles and the impact of transactional submissions, the monitoring of log hardened is a straightforward reference to step one.

Redo (secondary)

The Redo object is a hardened log in the secondary replica (which may be in-memory or from disk), is the step of "implementing" Data synchronization in AlwaysOn, and the other steps are to pave the line, even if the last step of the log cure, it can only guarantee the primary and secondary replicas of the data consistent, The true data synchronization must wait for the redo of the secondary replica to complete.

In this phase, two factors determine the size of the data delay: The log size of the redo and the speed of the redo. Below we look at the performance monitoring of the next redo phase from these two perspectives:

Monitoring Redo Log Size

SELECT Ag.name as Ag_name, ar.replica_server_name as Ag_replica_server, dr_state.database_id as database_id,

Dr_state.redo_queue_size, is_ag_replica_local = case

When ar_state.is_local = 1 then N ' Local '

ELSE ' REMOTE '

END,

Ag_replica_role = case

When Ar_state.role_desc was NULL then N ' Disconnected '

ELSE Ar_state.role_desc

END

From ((Sys.availability_groups as AG joins Sys.availability_replicas as ar on ag.group_id = ar.group_id)

JOIN sys.dm_hadr_availability_replica_states as ar_state on ar.replica_id = ar_state.replica_id)

JOIN sys.dm_hadr_database_replica_states dr_state on

ag.group_id = dr_state.group_id and dr_state.replica_id = ar_state.replica_id;

Monitoring Redo the Speed

The speed of the redo can be achieved by the following performance counters:

    • SQL server:database Replica > Redone bytes/sec

Explanation: The number of bytes completed redo per second by the secondary replica;

Note: When analyzing this indicator, it must be combined with the log size of the redo to divide the results of the two to obtain a approximate synchronization delay time.

End

We know that AlwaysOn is the introduction of SQL Server 2012 technology, the current author does not have a lot of cases, the content of this article is based on my limited practical experience summed up, if there is something wrong also please criticize.

My kick for AlwaysOn performance monitoring

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.