Please do not use seconds_behind_master to measure the delay time of MySQL master standby

Source: Internet
Author: User
Tags percona

Links: http://www.woqutech.com/?p=1116

MySQL itself provides seconds_behind_master through show slave status to measure replication latency between master and slave, but today comes a scene that finds Seconds_behind_master as 0, a show that prepares the library Slave status shows that io/sql threads are normal, and changes on MySQL's main library are not synchronized to the standby for a long time. Without human intervention, until one hours later, MySQL will automatically re-connect the main library and continue to replicate the changes to the main library.

Impact Range: MySQL, Percona, all versions of MariaDB.

Although this scenario is very special, the probability of encountering it is not high, but the individual feels it necessary to remind DBAs who are using MySQL. The analysis of this scenario also helps us to understand the MySQL replication retry mechanism in more depth.

First, the steps to reproduce

Build the master-slave copy, temporarily disconnect the main library's network, and kill the Binlog dump thread of the main library MySQL.

In this case, the copy of the Repository is observed, show slave status:

Slave_io_running:yes

Slave_sql_running:yes

seconds_behind_master:0

However, after you restore the network, make any changes in the main library, the repository will not be able to obtain data updates. And the show slave status on the standby shows that the IO thread SQL thread is all right, and the replication latency is always 0.

All normal, normal monitoring software will not find the standby database has data latency.

Second, the principle analysis

MySQL's Replication is a key differentiator from other databases. is also the basis for scalability and high availability. It is already very intelligent, and only requires us to call change master to specify the Binlog file name and offset location to build the copy relationship from the main library to the standby.

The MySQL replication thread automatically logs the current copy location, automatically connects to the main library when the primary and standby replication is interrupted, and restarts the replication from the location where it was last interrupted. These operations are fully automated and do not require human intervention. This brings a lot of convenience to MySQL DBA, but it also hides a lot of details.

To really understand the truth of the problem and how to solve it, we still need to really understand the principle of MySQL replication.

2.1 "Push" or "pull"

First, the replication of MySQL is "push", not "pull". "Pull" means that the MySQL repository is constantly circulating to ask if the main library has data updates, which is a resource-intensive and inefficient way. "Push" means that MySQL's main library pushes this change to the repository when it has its own data update, which only interacts when the data is changed and consumes less resources. If you are a programmer, you will certainly choose to "push" the way.

So MySQL specifically how to "push" the column, in fact, the repository in the main library when requesting data change records, you need to specify from the main library binlog which file (master_log_file) of the specific number of bytes offset location (master_log_pos). Correspondingly, the main library initiates a thread of Binlog dump, which starts a record of the change from this location and sends a message to the repository. The repository always listens for changes from the main library and receives one to apply this data change locally.

2.2 Cause resolution

From the above analysis, we can roughly guess why the show slave status shows everything is OK, but in fact the changes in the main library are not synchronized to the standby library:

In the event of a problem, we killed the Binlog dump program. As a listener, the repository has not received any changes, it will assume that the main library for a long time without any changes, resulting in no change data pushed over. The repository is not able to determine whether the corresponding Binlog dump thread on the main library terminated unexpectedly or for a long time without any data changes. Therefore, for both cases, the repository is displayed as normal.

Of course, MySQL will try to avoid this situation. Like what:

L NOTIFY the repository thread of the kill when Binlog dump is killed. So we need to ensure that this notification is not sent to the standby library, that is, the key to reproduce the problem is that the Binlog dump was killed because of network congestion or other reasons can not be sent to the standby library.

L Standby If you have not received a change from the main library for a long time, it will re-connect the main library at intervals.

2.3 Problem Avoidance

Based on the above analysis, we know that MySQL is really unavoidable in this case, then there are ways we can avoid columns:

1. Passive processing: Modify the monitoring method of the delay, find the problem timely processing.

2. Proactive prevention: correctly set--master-retry-count,--master-connect-retry,--slave-net-timeout copy retry parameters.

L Passive Processing

Most of MySQL's latency monitoring directly captures the seconds_behind_master in show slave status. In this case, the seconds_behind_master cannot be used to truly measure the replication delay between the master and the standby. We recommend that you obtain the primary and standby delay scenario by polling the main library for insert time information and by copying it to the standby. Percona provides a similar scenario for pt-heartbeat.

After discovering this problem, we only need stop slave; Start slave; Restarting replication will solve this problem.

L Proactive Prevention

MySQL can specify three parameters for copying a line Chengtonglian Main library:--master-retry-count,--master-connect-retry,--slave-net-timeout.

where Master-connect-retry and master-retry-count need to be specified when the change master is set up for primary and standby replication, and Slave-net-timeout is a global variable that can be provisioned online at MySQL runtime.

The specific retry policy is: The Repository is too slave-net-timeout seconds have not received the data from the main library, it will start the first retry. Then every master-connect-retry second, the repository attempts to re-connect the main library again. It will not give up the retry until Master-retry-count is retried. If the main library is attached to the retry process, it considers the current main library to be good and will start waiting for slave-net-timeout seconds.

The default value for Slave-net-timeout is 3,600 seconds, master-connect-retry defaults to 60 seconds, and Master-retry-count defaults to 86,400 times. That is, if no data changes are sent to the main library for one hours, the repository will attempt to re-connect the main library. That's why, in our simulated scenario, one hours later, the repository will be re-connected to the main library, and the reason for synchronizing data changes continues.

In this case, if your main library changes more frequently, you can consider a small set of slave-net-timeout to avoid the main library Binlog dump thread terminated, unable to push the latest update.

Of course, slave-net-timeout settings too small also have problems, this will cause if the main library changes are indeed relatively small, the repository frequently reconnect to the main library, resulting in a waste of resources.

The latency monitoring of the master and standby replication in the Q monitor monitoring of the Wharton technology is not monitored by the seconds_behind_master. It uses a pt-heartbeat-like approach to monitor replication latency for the master standby.

Please do not use seconds_behind_master to measure the delay time of MySQL master standby

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.