Managing SQL Server AlwaysOn (5)-General monitoring (1)--general monitoring

Last Update:2017-03-14 Source: Internet

Author: User

Tags dba failover sql error

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article belongs to the managing SQL Server AlwaysOn series article

Objective:
The previous sections mentioned how to do general management of AlwaysOn, and this section and the next section specifically explain and demonstrate "monitoring". Managing and monitoring These two words are confusing in many cases, but we can probably distinguish them, for example, I do backups, count management, respond to errors, exceptions, and this is also management, but for errors, exceptions, and notifications DBA this is monitoring, and monitoring sometimes does not need to intervene, such as I monitor disk space , I can ignore it when there is plenty of space.In the Daily DBA work, I personally pay much more attention to monitoring than the so-called management, because with reasonable monitoring, you can foresee a lot of problems, the actual intervention of the workload will be much less, in many places can also be automated management. You can also find performance issues. Therefore, I think that in the DBA work, the monitoring should be the first priority. In addition, monitoring is often tied to alerts to quickly notify the responsible person when a monitoring threshold is triggered.For a small number of availability groups that need to be monitored, SSMS provides an AlwaysOn dashboard and AlwaysOn health trace, which can be used as a monitoring tool. Wordy so much, we enter the formal explanation below.

AlwaysOn Dashboards (Dashboard):
Dashboards are a report interface that has some degree of interaction to show the situation of an alwayson environment. Can be opened as shown:

When you open it, you'll see this look:

As you can see from the second figure, there are three possible things in the "Sync Status" column: Synchronized (SYNCHRONIZED), unsynchronized (not SYNCHRONIZING), synchronizing (SYNCHRONIZING). For synchronous replicas, the "synchronized" state should be, and other states are not normal. For asynchronous replicas, you should not show "synchronized", but should be "synchronizing." However, regardless of the synchronization mode, the "not synchronized" state means that it is not normal.

For synchronization states, each replica may also have the following operational states: Pending_failover, PENDING, ONLINE, OFFLINE, FAILED, failed_no_quorum, and NULL when the replica is not connected. Detailed content can be viewed from Books Online: sys.dm_hadr_availability_replica_states and roles and operational status

In the upper right corner, there are three hyperlinks: Start Failover Wizard, view AlwaysOn run status events, view cluster quorum information, and the second view AlwaysOn run status events is some built-in extended events, which are described in the next section. The third contains configuration information for the existing cluster, such as:

Next, in Add/Remove Columns to the right:

You can see the following points when you open them:

From here, you can dynamically add and remove columns that need to be displayed on the dashboard. You can also right-click in the blanks and then expand and collapse the group:

AlwaysOn Health Trace (Heath trace):
AlwaysOn Health tracking is an extended events session, which can be seen in my previous series of articles on extended events: SQL Server Extended Events (extented events) from beginner to Advanced (1)-from SQL Trace to extented events. Created automatically when an availability group is created for the first time. Can be seen and opened in the place shown:

By opening the extended event, you can see the data captured in real time, or you can get a customized configuration by modifying the configuration, or by viewing AlwaysOn health events from the second option at the top right of the AlwaysOn dashboard.

Since this feature is essentially an extended event category, there are links to the extension events that I've given earlier, so it's not cumbersome here. This feature is important, and many of our subsequent monitoring scripts are more or less borrowed from extended events. The basics of general monitoring here, the next section, "Managing SQL Server AlwaysOn (5)-General monitoring (2)--extended event monitoring" specifically describes how to use extended events for monitoring. Also, in this article, I've given as many monitoring scripts as possible for non-extended events for the reader's reference.

Monitoring Content Reference:
From the work experience and foreign experts to summarize the knowledge, for SQL Server AlwaysOn, need to monitor the content of the following, if collected, I will continue to fill in:

The health condition and configuration information of the WSFC.
SQL Server TCP listener information.
The space of the disk where the SQL Server file resides. (this part of the thematic form of presentation, after the completion of a link)
AlwaysOn listeners.
Failover (or SQL Server availability Group, instance state change).
Status changes for AlwaysOn databases.
Log transfer, redo rate.

AlwaysOn Monitoring Script Demo:
1. T-SQL finds whether the current SQL instance is the primary replica:
Since many operations can only be performed on the primary replica, and from a management point of view it is also necessary to know which is the current primary replica, so we first need to find out where the primary replica is:

IF serverproperty (' ishadrenabled ') = 1BEGINSELECT   Agc.name--Availability group name, Rcs.replica_server_name--SQL cluster node name, Ars.role_ Desc  --replica role, Agl.dns_name  --listener name from  sys.availability_groups_cluster as Agcinner JOIN sys.dm_hadr_ Availability_replica_cluster_states as Rcson rcs.group_id = Agc.group_idinner JOIN Sys.dm_hadr_availability_replica_ states as Arson ars.replica_id = Rcs.replica_idinner JOIN sys.availability_group_listeners as AGLON AGL.group_id = ARS.gro Up_idwhere  ars.role_desc = ' PRIMARY '--limits the primary replica, if not qualified, you can view all replicas end

The following two graphs are the result of a where condition and no where condition, respectively:

2. View AlwaysOn errors from sys.messages:
Here, we use the Sys.messages catalog view, the description of this view can be read Books Online: sys.messages, we use the following T-SQL statement to check the records of AlwaysOn related errors, this error can be used in the subsequent issue of warnings, This presentation is just the simplest, and the reader can make the appropriate changes to suit its needs:

SELECT message_id [Error_number], severity,--severity, between 1~25 textfrom sys.messages where text like ('%availability% ') and  is_event_logged = 1;--1= message is counted in the time log when an error occurs

3. The DMV used by AlwaysOn:

SELECT * FROM Sys.dm_hadr_clusterselect * from Sys.dm_hadr_cluster_membersselect * from Sys.dm_hadr_cluster_ Networksselect * from Sys.availability_groupsselect * from Sys.availability_groups_clusterselect * from Sys.dm_hadr_ Availability_group_statesselect * from Sys.availability_replicasselect * from Sys.dm_hadr_availability_replica_ Cluster_nodesselect * from Sys.dm_hadr_availability_replica_cluster_statesselect * from Sys.dm_hadr_availability_ Replica_statesselect * from Sys.dm_hadr_auto_page_repairselect * FROM Sys.dm_hadr_database_replica_statesselect * Sys.dm_hadr_database_replica_cluster_statesselect * FROM Sys.availability_group_listener_ip_addressesselect * FROM Sys.availability_group_listenersselect * from Sys.dm_tcp_listener_states

4. SQL Server TCP Listener information:

SELECT listener_id, IP_Address, Is_ipv4, Port, Type_desc, State_desc, Start_timefrom sys.dm_tcp_listener_states with ( NOLOCK) OPTION (RECOMPILE);

5. AG Status:

--AG Status DECLARE @HADRName VARCHAR SET @HADRName = @ @SERVERNAMESELECT N.group_name,n.replica_server_name,n.node_ Name,rs.role_desc,db_name (drs.database_id) as ' DBName ', drs.synchronization_state_desc,drs.synchronization_health _descfrom sys.dm_hadr_availability_replica_cluster_nodes Njoin Sys.dm_hadr_availability_replica_cluster_states CS On n.replica_server_name = Cs.replica_server_namejoin sys.dm_hadr_availability_replica_states rs on rs.replica_id = Cs.replica_idjoin sys.dm_hadr_database_replica_states drs on rs.replica_id = Drs.replica_idwhere N.replica_server_ Name <> @HADRName

6. SQL Server Agent alerts:

Use Mastergoselect message_id as Errornumber,textfrom sys.messageswhere TEXT like ('%availability% ') and language_id = 1033

You can see a lot of things:
Here are some of the possible error numbers that can be useful for subsequent extensions:

**about AG useful error numbers**
ErrorNumber	Warning Name	Reason
1480	Change of role of AG (Failover)	implies that the availability group has failed over, that the reason for the transfer needs to be checked and whether it needs to be transferred back.
976	Database not accessible (Accessible)	The database is not available for queries, possibly data hangs or secondary replicas are not available for read access.
983	Database not accessible (Accessible)	The replica role is parsing (resolving), the database is inaccessible, the SQL error log and the network, storage, and other related error message event logs are checked, possibly due to startup, failover, communication, or cluster errors.
3402	Database is being restored (db Restoring)	The database is being restored or other states that cannot be recovered, checking that the database is corrupt or in a pending state.
19406	AG Copy status change (AG Replica Changed States)	Replica status causes changes due to issues such as startup, failover, communication, or cluster failure. Check the log and then subdivide the processing means.
35206	Connection Timeout (Connection timeout)	Check for network and firewall issues, or if there is a copy transfer failure.
35250	Failed to join database (the connection to theprimary replicawas not active)	Check whether the SQL error log really listens on ports and all IPs because the connection to the primary replica fails to cause the command to be processed.
35264	Data movement hangs (movement Suspended)	Need to manually recover data movement, check the SQL log to find the cause.
35273	Database is inaccessible (db inaccessible)	Recovery failed because of a session break with the primary replica. may be caused by WSFC quorum, link, endpoint configuration, or permissions issues.
35274	Database recovery hangs (db Recovery Pending)	The secondary replica waits to receive the transaction log from the primary replica to return online, ensuring that the primary replica is on an online state.
35275	Database hangs (Suspect state)	The database is in a potentially corrupt state and cannot connect to the availability group. Restore the database and then connect the availability group.
35276	Database out of sync (db out of sync)	Manual intervention is required to restore synchronization.
41091	Copy offline (Replica going Offline)	If repeated occurrences need to be checked for reasons. The copy may be offline because the lease expires or the update fails. Or a network problem, or the sp_server_diagnostic query times out.
41131	AG Online failure (Failed to bring AG online)	Confirm that the WSFC node is online and that an AG resource exists in the WSFC cluster
41142	Replicas cannot be the primary role (Replica cannot become Primary)	1 or more databases are out of sync or cannot be connected to an availability group, or the cluster is started in forced quorum mode.
41406	AG is not prepared for automatic failover (ag not ready for auto Failover)	Primary and secondary replicas are configured in automatic failover mode, but secondary replicas are not ready for success. The replica may be in an unusable state. Check the secondary copy for additional information.
41414	Secondary replica not connected (secondary not Connected)	At least one secondary replica cannot connect to the primary replica. Check that the SQL error log is correct for ports and IP listening.

For example 1480, the AG role change, we can add monitoring in the SQL Agent warning, the template is as follows, once the change occurs, send the message to the DBA:

EXEC msdb.dbo.sp_add_alert@name = N ' [Name of alert] ', @message_id = 1480, @severity = 0, @enabled = 1, @delay_between_respons Es = 0, @include_event_description_in = 1; Goexec msdb.dbo.sp_add_notification@alert_name = n ' [name of alert] ', @operator_name = n ' [operator] ', @notification_ method = 1; GO

Other parts of the monitoring will be gradually improved in the follow-up.

7. Related Performance counters:The following two counters are dedicated to AlwaysOn counters to understand the health status and performance of the current availability group. Sqlserver:availability Replica and Sqlserver:database Replica

Sqlserver:availability Replica: The granularity of monitoring is the availability group level.
Sqlserver:database Replica: Monitor granularity for the database level in the availability group.

Both counters need to be differentiated between primary and secondary replicas, some are monitored, and some are only meaningful for a role. The following list is from the p135 in the SQL Server 2012 Implementation and Management Guide:

**AlwaysOn performance counters that can be used by availability replicas of different roles**
Counter Name	Primary replica	Secondary copy/secondary replica
Availability Replica:sends to Replica/sec	√	√
Availability Replica:receives from Replica/sec	√	√
Availability Replica:bytes Sent to Replica/sec	√	√
Availability Replica:bytes Received from replica/sec	√	√
Availability Replica:sends to Transport/sec	√	√
Availability Replica:bytes Sent to Transport/sec	√	√
Availability Replica:resent Messages/sec	√	√
Availability Replica:flow Control Time	√
Availability Replica:flow Control/sec	√
Database Replica:redo Bytes Remaining		√
Database Replica:log Bytes received/sec		√
Database replica:file Bytes received/sec		√
Database Replica:log Remaining to Undo		√
Database replica:total Log requiring Undo		√
Database Replica:redone bytes/sec		√
Database Replica:recovery Queue		√
Database Replica:log Send Queue		√
Database replica:transaction Delay	√
Database replica:mirrored Write transactions/sec	√

For specific explanations of the counters, please search by yourself. In addition, we can query the contents of the counter by the following T-SQL command.

Select Object_name,counter_name,instance_name,cntr_valuefrom sys.dm_os_performance_counters where object_name like ' %replica% '

Summarize:
So far, the general monitoring content of AlwaysOn is introduced, but the specific refinement content and corresponding solution need to accumulate and improve in the work continuously. So there is no way to complete the list, in my work process if there is a new method, will continue to update.

Managing SQL Server AlwaysOn (5)-General monitoring (1)--general monitoring

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More