MySQL replication exception big literacy: Quick traceability and troubleshooting full solution
Https://mp.weixin.qq.com/s/0Ic8BnUokyOj7m1YOrk1tA
Author Introduction
Wang Songlei, currently working in Ucloud, is engaged in the MySQL database core research and development work, mainly responsible for the Ucloud cloud database UDB kernel troubleshooting work and the new database features of the research and development work.
Replication, as the MySQL native data synchronization feature, plays a critical role in MySQL's high-availability architecture. This article combs the replication problems encountered in the daily operations of the MySQL high-availability product udb, and summarizes the methods for troubleshooting replication exceptions when replication occurs.
First, error troubleshooting
1. Collect and copy information
In the event of an exception to replication, we will first collect replication-related information and error-related information, mainly through the following means.
(1) View Show slave status
Execute the command "show slave status" To view replication-related information. Focus on the following pieces of information:
master_log_file:mysql-bin.000063
read_master_log_pos:282657539
The IO thread reads the Binlog file name of the main library and the location in the Binlog. These two fields represent the progress of the binlog transfer from the main library to the standby during the copy process.
relay_log_file:mysql-relay.000002
relay_log_pos:313885
The file name of the relay log to which the SQL thread executes and the location in the relay log.
relay_master_log_file:mysql-bin.000002
exec_master_log_pos:316585
The SQL thread executes the relay log corresponding to the Binlog file name in the main library and the location of the Binlog.
These four fields represent the progress of the Master library's data being replayed on the standby during the copy process.
Slave_io_running:yes
Slave_sql_running:no
The current problem is which thread, IO thread, or SQL thread.
retrieved_gtid_set:ed7c5ee4-762d-11e6-ab9e-6c92bf24c36a:14-3920163
Executed_gtid_set:04ffb4f5-762e-11e6-81e4-6c92bf26c5c2:1
These two fields only make sense when Gtid is turned on. Represents the gtid of transactions in the binlog that the IO thread receives corresponding to the Gtid and the transactions performed by the SQL thread.
The Gtid here will not be changed because of the replication, that is, the main library of Gtid corresponding transaction must be the main library after the execution, through replication sent over. The Gtid corresponding transaction for the standby library must be performed by the standby.
Last_errno/last_io_errno/last_sql_errno
Laset_error/last_io_error/last_sql_error
Io/sql a description of the error that occurred on the thread.
(2) Viewing the error log
The error log records the error message that occurs with MYSQLD, which is the error message of the replication, and also records information about the start and stop of the copy, which can be viewed in the following ways:
In error log, the following information is the main concern.
Start replication (start slave)
When you start replication from the library, the copy start location is recorded in error log, including the starting position of the IO thread reading the main library side Binlog and the start of the relay log executed by the SQL thread. The exact time to start copying is also recorded in error log.
Stop replication (stop slave)
When you stop copying from the library, error log records the location of the binlog of the main library that was read to when the IO thread stopped, and the time to stop copying.
Copy error message
The description of the replication error message is shown in the Last_error in show slave status, but if the error message is longer (especially in the case of multi-threaded replication), show slave status does not fully display the full information of the error, You need to view the error log to see the full error message. Like what
The above error message is not a complete description of the error message, you can see a more complete description of the information in error log, and the time the error occurred.
(3) Viewing binary log files
The binary log files here include the Binlog of the main library, the relay log from the library, the Binlog from the library.
The Binlog of the main library refers to the Binlog log of transaction records performed by the main library.
The relay log from the library refers to the Binlog log of the main library that is received from the library.
The Binlog from the library is the Binlog log of the log (Log-slave-updates on) and the transaction record that was executed from the library after the relay log is reproduced from the library SQL thread.
Logs recorded in a binary log file are recorded in event, such as a DML statement that typically consists of 4-5 event, and a DDL statement typically consists of 2 event.
Binary log files can convert Binlog logs to recognizable formats by command "show Binlog events" or tool mysqlbinlog.
Show Binlog events format is as follows:
Displays the contents of the record in the row format Binlog, which contains a DML statement and a DDL statement. DML statements contain Gtid, query, Table_map, Write_row, XID five EVENT,DDL statements containing Gtid, query two event.
The Mysqlbinlog tool can also parse Binlog, providing event information similar to the show Binlog event, as an example of one of the event examples:
Time of Event
The time at which transactions are performed for the primary library, regardless of the relay log and Binlog from the library, when the transaction is performed for the primary library
SERVER_ID of the event
Records the server_id of the database executing the transaction, which can be used to distinguish whether the transaction is a primary or a library execution
End_log_pos of the Event
The actual file location of the event from the End_log_pos in the relay log of the library corresponding to the Binlog in the main library
The End_log_pos of the main library and the Binlog from the library is the true location of the Binlog file
Event at XXX
At XXX represents the actual location of the event in the file
For the contents of the binary log files above, we need to focus on the following information:
Previous_gtids Events Records All Gtid information that was performed before the current binlog to locate the specific gtid.
The corresponding GTID in the GTID event, corresponding to the transaction is one by one, and the table name whether the transaction is performed by the main library or by the vault.
When the error occurs, the execution time of the transaction executes the specific statement of the transaction.
After the main library performs a database operation, the related logs are logged to the binlog of the main library. After the IO thread of the repository receives the Binlog log transferred from the main library, the logs are logged to the relay log, and if the log_slave_updates option is turned on by the standby, the SQL thread logs the Binlog log in the process of replaying relay log. These three binary file logs should be the same on the execution content.
(4) View other variables
View other replication-related system variables or states, such as:
Perform "Show variables like ' Gtid_mode '" to see if Gtid is turned on;
Perform "Show status Like ' Rpl_semi_sync_master_status '" To view the status of the semi-synchronous replication.
Not listed here.
Second, troubleshooting errors
After collecting the above replication information, the following is the main way to troubleshoot replication errors:
1. View Show slave status
See which thread (IO thread or SQL thread) is causing the error to see the cause of the error;
If an IO thread error occurs, record the file name and location of the Binlog received when the error occurs (if Gtid is turned on, log gtid);
If an error occurs with the SQL thread, the file name and location of the relay log to which the error occurred is logged (if Gtid is turned on, Gtid is logged).
2. Check the error log
Further confirm the cause of the error, in part because it is only logged in the error log and will not be shown in show slave status. For example, lack of space causes IO thread error, such as network interruption caused IO thread exception and so on.
See if replication is not available because other users shut down replication gracefully or the thread that is related to the kill replication.
Check to see if an error occurred, whether the statement that just started the replication and that the error occurred is the one executed by the first copy. If you are the first statement, you need to consider whether replication exceptions are caused by a replication error, or if other factors, such as unexpected downtime, cause replication-related binary log files to be incorrect.
Compare the error logs of the main and standby libraries to see if the same replication errors have occurred and whether the main library has made special error handling.
3. Compare binary log files
Compare whether the Binlog being received by the repository is in conflict with the Binlog being executed by the main library (the files and locations of Binlog received by the repository are larger than the main library).
If Gtid is turned on, it is gtid to see if the repository has performed a database operation itself, to see if the Gtid has performed more than the main library, and whether the repository has performed the gtid of other hosts.
Depending on the file and location (or Gtid) of the Binlog at the time of the error, the binary files of the primary and standby libraries are parsed, and the actions recorded in the log are compared to the same file and location (or the same gtid).
Check the binaries of the repository to see if the repository has performed any actions that conflict with the main library.
Summarize
For a copy that is in a normal state, you should be in the following state:
Check that the replication status should be normal, such as show slave status shows that both the IO thread and the SQL thread are running in Yes, such as semi-synchronous replication in show status like "rpl%" displays the semi-synchronous replication status on.
Neither the main library nor the standby is reported with a copy-related error message.
The contents of the database operations recorded in the binary log files of the master and the repository should be consistent, and the contents of the data in the master and standby repositories should be consistent.
By comparing the above information, looking at the status of an exception or logging, you can help us to troubleshoot replication-related errors.
Third, version and configuration
In general, versions and configurations differ only in the display format of various information and do not unduly affect the methods described above.
1. Version
Examples of the above information collection and analysis are examples in the mysql-5.7 version, and there may be differences in the content of the information or the way the information is stored in different large versions.
The mysql-5.6 version and the mysql-5.7 version have the following differences in replication-related information:
Log:
When mysql-5.6 stops copying, error log will have an incorrect information record:
GTID:
Mysql-5.6 's gtid_executed is presented in the form of the global system variables, and mysql-5.7 is presented as a mysql.gtid_executed table.
BINLOG:
The mysql-5.6 version uses the following event to record the self-increment ID when using the self-increment ID.
#170419 11:27:12 Server ID 30061 end_log_pos 494 CRC32 0x7a9f75c6 Intvar
SET insert_id=1/*!*/;
2. Configuration
The main embodiment of the differences in the configuration includes Gtid_mode and Binlog_format.
(1) Gtid_mode
When Gtid is turned on, Gtid is judged by who executes the transaction, whether it executes, the transaction receives, and the execution progress. At the same time can be seen through the show slave status can be visualized Gtid reception, implementation of the situation.
When Gtid is closed, file and Pos are the criteria for receiving and executing, server_id as the standard by whom the transaction is executed. However, all the server_id of the transaction are not fully displayed, so it is difficult for us to troubleshoot the problem.
(2) Binlog_format
Binlog_format affects the format of log content recorded in the Binlog, taking the same INSERT statement as an example, the format of the statement format is recorded in Binlog as follows (only the Difference section is shown):
The row format is logged to Binlog in the following format:
Iv. causes of common replication errors and analysis process
After collecting the above replication related information and error information, we need to base on the actual error information analysis, here are a few common replication errors, we can through some or all of the relevant information collected in the above section to analyze the cause of replication errors.
1. Execute statement from library conflict with Main library
(1) Cause of error
When a DML statement or DDL statement is executed from a library, there is a case of inconsistent data between the main library and the slave library. The statements that cause the main library to execute are not able to execute properly from the library.
(2) Error message
The common error message is as follows: Because of a statement that conflicts with the main library from the library, resulting in a replication error.
Failed to create library or table
INSERT statement PRIMARY KEY violation
The DELETE statement could not find the corresponding statement
Because this is a more common cause, all operations that cause master-slave conflicts cause replication errors, which are not listed here.
(3) Cause analysis process
This is an example of why a database error has been created due to the existence of a database.
View error Log
The detailed error message displayed in error log is as follows:
Display failed while executing Gtid 0c1b77a7-c113-11e6-9bd6-d4ae52a34783:6. The reason for the error is that the database already exists and cannot be created.
View Show slave status
When the error occurs, when you view the information shown in show slave status, you will find the following information:
In the information displayed in Executed_gtid_set, in addition to the master UUID corresponding to the Gtid, there is another Gtid, we can view the Gtid from the library, execute the following statement:
The discovery of additional gtid is generated from the execution of the library.
View Binlog logs from a library
The Binlog log from the library is the Binlog information of the host that the SQL thread has reproduced or the Binlog log of transactions that are performed from the library itself. These transactions can be distinguished by server_id or Gtid.
Here, for example, to create a database failure, find the 3a169e6c-f1d0-11e6-bb30-d4ae52a34783:1 corresponding transaction from the library Binlog, and find the following information:
View relay log logs from library
From the library relay log logging is the Binlog log information that the IO thread receives from the main library, and we look at the transaction information for the Gtid that failed to execute:
Summarize
Finally, you can confirm that the replication failure occurs when the SQL thread executes the statement that created the database again after the database statement was created from the library.
2, the Binlog of the main library is lost
(1) Cause of error
During the copy process, the replication exception occurs because the Binlog of the main library that needs to be read from the library is lost. The main reasons for the loss of the main library binlog are as follows:
The main library performs the Reset Master command
Main Library execution purge binary/master logs command
Expire_logs_days is set up in the main library and the Binlog is deleted automatically
Binlog of the main library was mistakenly deleted
(2) Error message
If the host binlog is not found, the error log from the library will report the following errors:
(3) Cause analysis process
View error Log
The detailed error message displayed in error log is as follows:
The error message indicates that the corresponding Binlog file could not be found.
View the main library Binlog logs
Viewing the list of Binlog log files for the main library, you may find that the binlog of the main library becomes a restart record:
or binlog that need to be copied have been deleted:
Summarize
If Binlog restarts the recording, it is usually because the main library has performed the Reset Master command, which causes all binlog to be deleted.
If Binlog continues to record, it is only removed from the binlog required by the library, usually because the main library manually executes the purge binary logs command, or the log is retained longer than the Expire_logs_days setting.
3. Statements from the library that do not have the main library replicated
Because of the nature of Gtid, SQL threads do not perform the same gtid transaction, that is, if the SQL thread discovers that the gtid corresponding to a transaction read from relay log already exists in the gtid_executed from the library, then the SQL thread does not exist.
(1) Cause of error
During the copy process, the Gtid for the transactions performed by the main library already exist in the gtid_executed from the library, and the transactions are not executed from the library, resulting in inconsistent data from the main and library. This is usually the case:
The host performed reset Master (the first binlog of the host is currently read from the library and does not cause the file to be found because of reset master)
Re-master from, from the library does not clear from the library's Binlog
(2) Error message
In the process of ignoring the transaction performed by the host from the library, no errors are reported from the library copy, so the replication exception is easily ignored and there is no way to find it in time.
Subsequent DML and DDL operations may fail with execution failures due to inconsistencies between the main library and the database from the library.
(3) Cause analysis process
For example, we cannot find the corresponding table in the INSERT statement.
View error Log
Error message is logged in the error log:
View Show slave status
Show slave status shows all of the information is normal, not binlog the database execution transaction generated. This does not preclude the possibility of performing a drop table operation from the library shutdown Binlog.
View Table
Execute commands on both the host and from the Library Show create table MYDB.MYTBL4, discovering that there is no mydb.mytbl4 from the library.
(4) Parsing Binlog log
Resolve the host Binlog log to view the transaction log for the build table:
Parse the Binlog log from the library to find out if there is a transaction log for the table:
At this point we find that the statements executed from the library and the host are not the same for the same gtid.
(5) Summary
From the above analysis, we infer that there is no execution of the table statement from the library, which causes the main library data to be inconsistent.
(6) Description
In this case, the mysql-5.7 version will have a stricter checksum on replication, and if the host sends Gtid less than the Gtid from the library, the following error will be reported:
However, even in version 5.7, if replication is initiated (after the error is restarted), the main library executes more gtid than from the library and will still report the same error.
4. The main library executes a statement that does not replicate
(1) Cause of error
Operations performed on the main library are not written to Binlog. This does not consider the case where the main library is actively shutting down Binlog.
(2) Error message
A failure from a library execution occurs because the main library and the data from the library are inconsistent, causing the operation of the main library to replicate to the slave library. Such as:
Failed to create table for federated engine
(3) Cause analysis process
Here is an example of a table using connection to create a federated engine.
View error Log
Error message is logged in the error log:
View the main library and the server table from the library
A record with the name S is present in the server table in the main library:
A record with the name S is not present from the library:
View the Create Server documentation description
The Create server statement is recorded in the document and is not recorded in the Binlog. This results in inconsistent data from the main library and from the library. Replication is not working properly.
Summarize
For operations that do not enter Binlog, the main library is required and executed from the library simultaneously in case the main library and the library are not inconsistent.
5. Repeat the relay log statement from the library (non-gtid, non-multithreaded replication)
When the variable relay_log_info_repository is set to file, the SQL thread from the library will update the corresponding file and location information to the relay_log.info after each execution of a transaction. For the SQL to continue copying from the correct location when restarting from the library.
(1) Cause of error
If a physical machine is down or an unexpected outage occurs from the library, it is possible that the SQL thread has executed a transaction in one of the relay logs, but the transaction's file and location information is not updated in a timely manner to relay_log.info. After a reboot from the library, the executed transaction is re-executed again.
(2) Error message
Recurring transactions include any transactions that are logged to the relay log, and the error messages that may occur include:
Failed to create library or table:
Insert statement PRIMARY key conflict:
The DELETE statement could not find the corresponding statement:
Because all types of transactions are possible, they are not listed here.
(3) Cause analysis process
Here is an example of a primary key violation of an INSERT statement.
View error Log
The following error message is logged in error log:
You can see that the first transaction that the SQL thread executes after it is started has a primary key violation error.
View Show slave status
Show slave status shows all of the information is normal, not binlog the database execution transaction generated.
View Table Mydb.k2
This record already exists in the table.
View relay log and Binlog from the library
View the relay log from the library, from the start location of the copy./relaylog002.000002:616 view
To view Binlog from a library:
Summarize
By analyzing the above-mentioned Binlog content, relay log does not record the same INSERT statement, and the Binlog from the library shows that the statement has been executed, and when restarted from the library, attempts to execute the same INSERT statement again, resulting in a primary key violation of the INSERT statement.
Description
If replication uses Gtid, then the Gtid feature causes the same statement to be executed from the library.
If you use multi-threaded replication with 5.7 version replication, then Mts_recovery will fix the problem.
This error can occur only in the case of non-multithreaded replication, non-gtid replication.
V. Summary
If replication errors occur, you can provide more information that can be used to troubleshoot replication errors by gathering information about replication and error-related information, and by analyzing the information that is associated with a normal replication exception.
Of course, the replication errors are various, not all errors can be detected to the specific cause. Many of the replication errors are difficult or impossible to troubleshoot, such as the main library or the Binlog log file from the library has been lost, such as the shutdown of Binlog after some operations caused replication inconsistencies, such as some kernel bugs cause the replication logic of MySQL itself has been abnormal and so on.
MySQL replication exception big literacy: Quick traceability and troubleshooting full solution