Links: http://www.woqutech.com/?p=1116
MySQL itself provides seconds_behind_master through show slave status to measure replication latency between master and slave, but today comes a scene that finds Seconds_behind_master as 0, a show that prepares the library Slave status shows that io/sql threads are normal, and changes on MySQL's main library are not synchronized to the standby for a long time. Without human intervention, until one hours later, MySQL will automatically re-connect the main library and continue to replicate the changes to the main library.
Impact Range: MySQL, Percona, all versions of MariaDB.
Although this scenario is very special, the probability of encountering it is not high, but the individual feels it necessary to remind DBAs who are using MySQL. The analysis of this scenario also helps us to understand the MySQL replication retry mechanism in more depth.
First, the steps to reproduce
Build the master-slave copy, temporarily disconnect the main library's network, and kill the Binlog dump thread of the main library MySQL.
In this case, the copy of the Repository is observed, show slave status:
Slave_io_running:yes
Slave_sql_running:yes
seconds_behind_master:0
However, after you restore the network, make any changes in the main library, the repository will not be able to obtain data updates. And the show slave status on the standby shows that the IO thread SQL thread is all right, and the replication latency is always 0.
All normal, normal monitoring software will not find the standby database has data latency.
Second, the principle analysis
MySQL's Replication is a key differentiator from other databases. is also the basis for scalability and high availability. It is already very intelligent, and only requires us to call change master to specify the Binlog file name and offset location to build the copy relationship from the main library to the standby.
The MySQL replication thread automatically logs the current copy location, automatically connects to the main library when the primary and standby replication is interrupted, and restarts the replication from the location where it was last interrupted. These operations are fully automated and do not require human intervention. This brings a lot of convenience to MySQL DBA, but it also hides a lot of details.
To really understand the truth of the problem and how to solve it, we still need to really understand the principle of MySQL replication.
2.1 "Push" or "pull"
First, the replication of MySQL is "push", not "pull". "Pull" means that the MySQL repository is constantly circulating to ask if the main library has data updates, which is a resource-intensive and inefficient way. "Push" means that MySQL's main library pushes this change to the repository when it has its own data update, which only interacts when the data is changed and consumes less resources. If you are a programmer, you will certainly choose to "push" the way.
So MySQL specifically how to "push" the column, in fact, the repository in the main library when requesting data change records, you need to specify from the main library binlog which file (master_log_file) of the specific number of bytes offset location (master_log_pos). Correspondingly, the main library initiates a thread of Binlog dump, which starts a record of the change from this location and sends a message to the repository. The repository always listens for changes from the main library and receives one to apply this data change locally.
2.2 Cause resolution
From the above analysis, we can roughly guess why the show slave status shows everything is OK, but in fact the changes in the main library are not synchronized to the standby library:
In the event of a problem, we killed the Binlog dump program. As a listener, the repository has not received any changes, it will assume that the main library for a long time without any changes, resulting in no change data pushed over. The repository is not able to determine whether the corresponding Binlog dump thread on the main library terminated unexpectedly or for a long time without any data changes. Therefore, for both cases, the repository is displayed as normal.
Of course, MySQL will try to avoid this situation. Like what:
L NOTIFY the repository thread of the kill when Binlog dump is killed. So we need to ensure that this notification is not sent to the standby library, that is, the key to reproduce the problem is that the Binlog dump was killed because of network congestion or other reasons can not be sent to the standby library.
L Standby If you have not received a change from the main library for a long time, it will re-connect the main library at intervals.
2.3 Problem Avoidance
Based on the above analysis, we know that MySQL is really unavoidable in this case, then there are ways we can avoid columns:
1. Passive processing: Modify the monitoring method of the delay, find the problem timely processing.
2. Proactive prevention: correctly set--master-retry-count,--master-connect-retry,--slave-net-timeout copy retry parameters.
L Passive Processing
Most of MySQL's latency monitoring directly captures the seconds_behind_master in show slave status. In this case, the seconds_behind_master cannot be used to truly measure the replication delay between the master and the standby. We recommend that you obtain the primary and standby delay scenario by polling the main library for insert time information and by copying it to the standby. Percona provides a similar scenario for pt-heartbeat.
After discovering this problem, we only need stop slave; Start slave; Restarting replication will solve this problem.
L Proactive Prevention
MySQL can specify three parameters for copying a line Chengtonglian Main library:--master-retry-count,--master-connect-retry,--slave-net-timeout.
where Master-connect-retry and master-retry-count need to be specified when the change master is set up for primary and standby replication, and Slave-net-timeout is a global variable that can be provisioned online at MySQL runtime.
The specific retry policy is: The Repository is too slave-net-timeout seconds have not received the data from the main library, it will start the first retry. Then every master-connect-retry second, the repository attempts to re-connect the main library again. It will not give up the retry until Master-retry-count is retried. If the main library is attached to the retry process, it considers the current main library to be good and will start waiting for slave-net-timeout seconds.
The default value for Slave-net-timeout is 3,600 seconds, master-connect-retry defaults to 60 seconds, and Master-retry-count defaults to 86,400 times. That is, if no data changes are sent to the main library for one hours, the repository will attempt to re-connect the main library. That's why, in our simulated scenario, one hours later, the repository will be re-connected to the main library, and the reason for synchronizing data changes continues.
In this case, if your main library changes more frequently, you can consider a small set of slave-net-timeout to avoid the main library Binlog dump thread terminated, unable to push the latest update.
Of course, slave-net-timeout settings too small also have problems, this will cause if the main library changes are indeed relatively small, the repository frequently reconnect to the main library, resulting in a waste of resources.
The latency monitoring of the master and standby replication in the Q monitor monitoring of the Wharton technology is not monitored by the seconds_behind_master. It uses a pt-heartbeat-like approach to monitor replication latency for the master standby.
Please do not use seconds_behind_master to measure the delay time of MySQL master standby