First, we know that in MySQL, the binary log is the server layer, which is used primarily for the decision-making from replication and instant-point recovery. The transaction log (redo log) is the InnoDB storage engine layer, which is used to ensure transaction security. Now let's talk about some of the details of MySQL's master-slave copy process, with specific chapters on master-slave replication.
Having learned the basics above, we can learn how replication works with a few questions below.
- Why does MySQL have binlog, and redo log?
- How is the transaction submitted? Does the transaction commit first write binlog or redo log? How to ensure that the two parts of the log to achieve sequential consistency?
- How does fault recovery work to ensure the security of master-slave replication?
- Why do you need to ensure that the write order of the binary logs and the InnoDB-level transaction commit order consistency?
Why does MySQL have binlog, and redo log?
This is because of the MySQL architecture, MySQL is a multi-storage engine, regardless of the use of the storage engine, there will be binlog, and not necessarily have redo log, simply said, Binlog is the MySQL server layer, redo log is the INNODB layer.
How is the transaction submitted? Does the transaction commit first write binlog or redo log? How to ensure that the two parts of the log to achieve sequential consistency?
MySQL binary log was launched after MySQL version 5.1, which is mainly used for primary and standby replication, and we review how MySQL works when the Binary log function is turned on/off.
MySQL does not open binary log case?
First look at what is Crashsafe? Crashsafe means that after the MySQL server has been restarted, it can guarantee:
– The data for all committed transactions still exists.
– Data for all transactions that are not committed are automatically rolled back.
InnoDB through redo log and undo log can guarantee the above two points. To ensure strict Crashsafe, the redo log must be written to the hardware store at the time each transaction is committed. Doing so will sacrifice some performance, but the reliability is the best. To balance the two, InnoDB provides a innodb_flush_log_at_trx_commit system variable that can be adjusted by the user according to the needs of the application.
Innodb_flush_log_at_trx_commit = 0|1|2
0– writes redo log buffer records to the redo log file every n seconds, and brushes the file into the hardware store 1 times. N is controlled by Innodb_flush_log_at_timeout.
-When each transaction commits, the record is written from redo log buffer to the redo log file, and the file is brushed into the hardware store.
For each transaction submission, only records are written to the redo log file from redo log buffer. When the Redo log is brushed into the hardware store is determined by the operating system and innodb_flush_log_at_timeout. This option guarantees data integrity when MySQL is down, while the operating system is working properly.
All transactions that have been committed within the storage engine are applied redo log recovery through the redo log, and all transactions that have prepare but no commit will apply the undo log as rollback. The client then connects to see that the data that has been committed exists in the database, and that uncommitted rollback data needs to be re-executed.
When MySQL turns on binary log?
To ensure data consistency between master and slave, MySQL must ensure the consistency of the Binlog and InnoDB redo logs (because the repositories replay the transactions committed by the main library through the binary log, and the main library Binlog is written before commit, If you finish writing the Binlog Main library crash, the transaction is rolled back when you start it again. However, this is done from the library, which results in inconsistent primary and standby data). So after opening binlog, how to ensure the consistency of binlog and InnoDB redo log? To do this, MySQL introduces a two-phase commit (both phase commit or 2pc), and MySQL internally automatically handles ordinary transactions as an XA transaction (internally distributed thing):
– Each transaction is automatically assigned a unique ID (XID).
–commit will be automatically divided into prepare and commit two stages.
The –binlog is treated as the coordinator of the Transaction (Transaction Coordinator), and the Binlog event is used as the coordinator log.
To learn about 2PC, you can refer to the documentation: HTTPS://EN.WIKIPEDIA.ORG/WIKI/TWO-PHASE_COMMIT_PROTOCOL
Binlog acts as the coordinator of the Transaction (Transaction coordinator) in 2PC. The InnoDB engine is notified by Binlog to perform prepare,commit or rollback steps. The entire process of committing a transaction is as follows:
As you can see in the above picture, the submission of a transaction is mainly divided into two main steps:
1. Preparation phase (Storage Engine (InnoDB) Transaction Prepare Phase)
At this point, SQL is executed successfully and generates XID information and memory logs for redo and undo. Then call the Prepare method to complete the first phase, the Papare method actually does nothing, set the transaction state to Trx_prepared, and the redo log brush disk.
2. Submission phase (Storage Engine (InnoDB) commit Phase)
2.1 Record the Coordinator log, which is the Binlog log.
If the prepare of all the storage engines involved in the transaction are successful, the Tc_log_binlog::log_xid method is called to write the SQL statement to BINLOG (write () writes binary log memory log data to the file system cache, Fsync () Writes binary log file system cache log data to disk permanently. At this point, the transaction is already committed. Otherwise, the Ha_rollback_trans method is called to roll back the transaction, and the SQL statement is not actually written to Binlog.
2.2 Tell the engine to do a commit.
Finally, the commit of the calling engine completes the commit of the transaction. Clears the undo message, brushes the redo log, and sets the transaction to the trx_not_started state.
PS: Recording Binlog is critical after the InnoDB engine prepare (that is, the redo log is written to disk).
As can be seen from the two phase submission process above, once the operation in step 2 is complete, the transaction is committed, even if the database is sent down at step 3 o'clock. It is also important to note that each step requires a fsync operation to ensure consistency between the upper and lower levels of the data. Step 2 of the Fsync parameter is controlled by Sync_binlog=1, step 3 of the Fsync by the parameters Innodb_flush_log_at_trx_commit=1 control, commonly known as "Double 1", is to ensure the fundamental crashsafe.
The parameters are described as follows:
Innodb_flush_log_at_trx_commit (Redo)
- 0:log buffer is written to log file once per second and flush. The InnoDB log refresh frequency is controlled by the Innodb_flush_log_at_timeout, which allows you to set the log refresh frequency to n seconds (where n is 1 ...). 2700, the default value is 1).
- 1: Log buffer data is written to log file each time the transaction is committed, and the flush operation is performed.
- 2: MySQL writes data from log buffer to log file for each transaction commit, without flush operation.
Sync_binlog (Binlog)
- 0: Refreshes the information in the Binlog_cache to disk determined by the OS.
- N: Flushes the information in Binlog_cache to disk per N transaction commits.
The two-phase commit protocol of a transaction guarantees that, in any case, the transaction is either present in the storage engine and the Binlog, or none of the two are present, guaranteeing the consistency of the data between the master and slave libraries. If a database system crashes, a crash recovery operation occurs when the database system restarts, and a transaction in the prepare state of the storage engine queries whether the transaction also exists in Binlog. If the transaction is committed within the storage engine (because the corresponding Binlog content may have been obtained from the library at this time), the transaction is rolled back if there is no transaction in the Binlog. For example, when a crash occurs between the first and second steps, a transaction that is clearly in the prepare State has not yet been written to Binlog, so the transaction is rolled back inside the storage engine so that the transaction does not exist in the storage engine and binlog; When the crash occurs between the second and third steps , a transaction in the prepare state exists in Binlog, and the transaction is committed inside the storage engine so that the transaction exists in both the storage engine and the Binlog.
To ensure the security of the data, the 3 steps listed above require calling Fsync to persist the data to disk. The third Fsync can usually be omitted due to the fact that prepare good transactions within the engine can be recovered through Binlog.
In addition, MySQL internal two-phase commits need to turn on innodb_support_xa=true, which is turned on by default. This parameter is a two-segment transactional commit that supports distributed transactions. Redo and Binlog data consistency is done by this two-stage commit, which causes the loss of transactional data if it is closed.
How does fault recovery work to ensure the security of master-slave replication?
When MySQL on binary log is crash recovery: MySQL generates XID in the prepare phase and then writes to Binlog in the commit phase. Whether the transaction is to be committed or rolled back at the time of the recovery is determined by the Binlog.
– The xid_log_event of the transaction is present and must be submitted.
– If the xid_log_event of the transaction does not exist, it should be rolled back.
The process of recovery is very simple:
– Read all the xid_log_event from the Binlog
– Tell InnoDB to commit these XID transactions
–innodb rollback of other transactions
Sum up, basically at most will appear below are several situations:
- When the transaction is crash in the prepare phase, the transaction is not written to binary log when the database recovery and the storage engine is not committed, rollback the transaction.
- When the transaction is crash in the Binlog phase, the log is not successfully written to disk, and the transaction is rollback on startup.
- When a transaction is crash after the Binlog log has Fsync () to disk, but InnoDB is not in time to commit, the MySQL database recovery will read the xid_log_event of the binary log. It then tells InnoDB to commit these XID transactions, and the InnoDB commits the transactions and rolls back the other transactions, keeping the storage engine and the binary log consistent.
To sum up, if a transaction succeeds in the prepare phase and the Binlog in the MySQL server layer is also successful, the transaction must be successful.
Why do you need to ensure that the write order of the binary logs and the InnoDB-level transaction commit order consistency?
The above mentioned two-phase commit process for a single transaction ensures that the storage engine and binary log logs are consistent, but how do you ensure that the InnoDB transaction log and the MySQL database binary logs are submitted in the same order in the concurrency scenario? When multiple transactions are committed concurrently, what is the impact if the binary log and the storage engine order are inconsistent?
This is due to backup and recovery needs, such as when backing up with a physical backup tool such as Xtrabackup or ibbackup, and using backups to establish replication, such as:
For example, transactions start in T1,T2,T3 order, write binary logs (in T1, T2, T3 Order) to the log file system buffer, call Fsync () A group commit is made to permanently write the log files to disk, but the order of the storage engine commits is T2, T3,T1. when T2, T3 commits the transaction, if the database recovery through the online physical backup to establish the replication, because at the InnoDB storage Engine layer will detect transaction T3 on the upper and lower levels of the transaction commits, do not need to be restored, when the master and standby data inconsistent (when building slave, The log offset for change master to is recorded T3 after the transaction location.
To address these issues, prior to the previous MySQL 5.6 release, the Prepare_commit_mutex lock was serially secured to ensure that the MySQL database upper-level binary logs and the InnoDB storage engine layer had a consistent transaction submission sequence, which would then result in group commit Commit) attribute does not take effect. To satisfy the data persistence requirement, a full transaction commits up to 3 fsync operations. In order to increase the number of transaction commits per unit time in MySQL when Binlog is turned on, it is necessary to reduce the number of fsync calls that result from each transaction submission process. As a result, MySQL has joined Binlog Group commit technology since version 5.6 (MariaDB 5.3 has been introduced).
The MySQL database acquires the Prepare_commit_mutex lock internally in the prepare redo phase, and only one transaction can obtain the mutex at a time. With this notorious Prepare_commit_mutex lock, the redo log and Binlog brush discs are serialized, and serialization is only intended to ensure that redo log and binlog are consistent, and then the group commit is not implemented, sacrificing performance. The whole process is as follows:
As can be seen in Prepare_commit_mutex, only when the last transaction commits to release the lock, the next transaction can perform prepare operations, and in each transaction the Binary log does not the invocation of Fsync (). Due to the high cost of writing memory data to disk, the performance of permanently writing log data to the disk database will drop sharply if fsync () is frequent. At this point the MySQL database provides the Sync_binlog parameter to set how many Binlog logs are generated when a fsync () is called to flash the binary log into the disk to improve overall performance.
As shown in MySQL when binary log is turned on, use Prepare_commit_mutex and sync_log to ensure that the binary log and storage engine sequence are consistent, Prepare_commit_ The lock mechanism of a mutex results in a very poor performance when committing a transaction with high concurrency, and the binary log cannot be group commit.
The problem was raised in the 2010 MySQL database conference, where the Facebook MySQL technology group, Percona, had proposed solutions, culminating in the final "perfect" solution by Kristian Nielsen, the MARIADB database developer. In this case, not only the MySQL database upper-level binary log write is group commit, the InnoDB storage engine layer is also group commit. In addition, the original lock Prepare_commit_mutex is removed, which greatly improves the integrity of the database. MySQL 5.6 uses a similar implementation, called BLGC (Binary Log Group Commit), and divides the transaction submission process into three stages, Flush stage, Sync stage, Commit stage.
How does the BLGC (Binary Log Group commit) group commit to ensure the order of transaction records to Binlog and the consistency of transaction submission to the storage engine?
After MySQL 5.6 BLGC technology appeared, in this case, not only MySQL database upper layer binary log write is group commit, InnoDB storage engine layer is also group commit. In addition, the original lock Prepare_commit_mutex is removed, which greatly improves the integrity of the database. The commit process for its transactions is divided into three stages, Flush stage, Sync stage, commit stage. Such as:
Binlog Group submitted the basic idea is that the introduction of the queue mechanism to ensure that the InnoDB commit sequence and Binlog order of the same, and the transaction group, the group of Binlog brush disk action to a transaction, to achieve group submission purposes. When committing on the upper level of the MySQL database, it is first placed in a queue in order, the first transaction in the queue is called leader, and the other transaction is called Follow,leader, which controls the behavior of follow.
As can be seen, each stage has a queue, each queue has a mutex protection, agreed to enter the queue the first thread is leader, the other thread is follower, all things to leader to do, leader after all the action, notify the follower brush disk end. BLGC is divided into 3 stages, flush phase, sync phase and commit phase of the transaction submission.
Writes the binary log of each transaction into memory.
1) Hold Lock_log mutex [leader hold, follower wait].
2) Gets a set of Binlog in the queue (all transactions in the queue).
3) Binlog buffer to the I/O cache.
4) Notify dump thread dump Binlog.
Flush the in-memory binary log to disk, and if there are multiple transactions in the queue, then only one fsync operation completes the binary log write, which is BLGC.
1) Release Lock_log mutex, hold Lock_sync Mutex[leader hold, follower wait].
2) Place a set of Binlog on the disc (Sync action, the most time-consuming, assuming Sync_binlog is 1).
leader calls the commit of the storage engine layer transaction sequentially , InnoDB itself supports group commit, which fixes an issue that previously caused group commit to fail due to a lock prepare_commit_mutex.
1) Release Lock_sync mutex, hold Lock_commit Mutex[leader hold, follower wait].
2) Iterate through the transactions in the queue and make a InnoDB commit.
3) Release the Lock_commit mutex.
4) Wake up the waiting thread in the queue.
Note: Because there are multiple queues, each queue has a mutex protection, the queues are sequential, and a thread entering the queue is leader, so the flush phase of leader may be the follower of the sync phase, but follower is always follower.
When there is a set of transactions in the commit phase, other new things can be flush phase, so that group commit continues to take effect. Of course, the effect of group commit is determined by the number of transactions in the queue, and if there is only one transaction per queue, the effect may be similar or worse than before. However, the more time a transaction is committed, the more effective the group commit is, and the greater the performance of the database is raised.
MySQL provides a parameter binlog_max_flush_queue_time (MySQL 5.7.9 version expires), the default value is 0, to control the MySQL 5.6 new BLGC (binary log Group commit), Is the amount of time to wait in the flush phase of the binary log group submission, even if a previous set of transactions are committed, the current set of transactions does not immediately enter the sync phase, but at least a period of time is required, and the benefit is that the group commits more transactions. However, this can also cause the response time of the transaction to become slower. The default of 0 means no wait, and the recommended setting is still 0. Unless there is a large number of connections (such as 100 connections) in the user's MySQL database system, and the transaction is constantly being written or updated.
MySQL 5.7 Parallel replication implements master-slave multithreaded replication based on the main library BLGC (binary log group Commit) mechanism and identifies the same set of transactions in binary log logs last_commited= N and the order of all transaction commits within the group transaction. In order to increase the number of transactions within a set of transactions, the concurrency of the standby group submission is introduced in Binlog_group_commit_sync_delay=n and Binlog_group_commit_sync_no_delay_count=n
Note: binlog_max_flush_queue_time in MySQL 5.7.9 and later versions no longer valid) parameters,mysql wait binlog_group_commit_sync_ Delay milliseconds until the number of Binlog_group_commit_sync_no_delay_count transactions is reached, a group commit is made.
Here is a diagram that provides a test group submission, and you can see that the set of TPS submissions is much higher.
Redo and Binlog sequential consistency problem "turn" in MySQL