Original address: http://www.cnblogs.com/liuhao/p/3714012.html
Written in front: the author level is limited, welcome to enlighten, all to the latest source code prevail.
InnoDB Redo Log
Let's start with what the next InnoDB redo log is, why you need to log redo log, and what redo log does. These are common sense, just for the sake of completeness of this article. InnoDB has buffer pool (referred to as BP). BP is the cache of the database page, any modification to InnoDB will be done first on the BP page, then such page will be marked as dirty and placed on the dedicated flush list, followed by the master Thread or specialized brush dirty threads periodically write these pages to disk or SSD. The advantage of this is to avoid each write operation of the disk resulting in a large number of random Io, the periodic brush dirty can be several times to merge the page changes into an IO operation, while the asynchronous write also reduces the latency of the access. However, if the server shuts down gracefully when the dirty page is not already in the disk, these modifications will be lost if the write operation is in progress, even if the database is not available due to a corrupted data file. To avoid this problem, InnoDB writes all modifications to the page to a dedicated file and restores the file from the time the database is started, which is the redo log file. Such techniques delay the refresh of the BP page, thus increasing the throughput of the database and effectively reducing the latency of the access. The problem is the extra overhead of writing redo log operations (sequential IO, of course, soon), and the time it takes to restore operations at database startup. The next step is to look at the structure of the log file, the build process, and the recovery process at database startup, combined with the MySQL 5.6 code.
log file StructureThe Redo log file contains a set of log files that are used for recycling. The size and number of Redo log files can be set by specific parameters, as described in: Innodb_log_file_size and Innodb_log_files_in_group. Each log file has a file header whose code is in "Storage/innobase/include/log0log.h", and we look at what information is logged in the log file header:
669/* Offsets of a log file header */670 #define LOG_GROUP_ID 0/* Log GROUP number */671 #define LOG_FILE_START_LSN 4/* LSN of the start of data in this672 log file */673 #define LOG_FILE_NO/* 4-byte archived log file Nu mber;674 This field was only defined in an675 archived log file */676 #define Log_file_was_created_by_h Ot_backup 16677/* A 32-byte field which contains678 the string ' Ibbackup ' and the679 Creati On time if the "log file was680 created by Ibbackup--restore;681" When Mysqld is first time started682 On the restored database, it can683 print helpful info for the user */684 #define Log_file_arch_comple TED os_file_log_block_size685/* This 4-byte field was TRUE when686 the writing of an archived LOG FILE 687 has been completed; This field is688 is defined in an archived log file */689 #define LOG_FILE_END_LSN (Os_file_log_bloCk_size + 4) 690/* LSN where the archived log file691 at least extends:actually the692 arch ived log file extend to a693 later LSN, as long as it's within the694 same log block as this LSN; This field695 was defined only if the archived log696 file has been completely written */697 #define Log_checkpoint_1 os_file_log_block_size698/* First CHECKPOINT field in the log699 header; We write alternately to the700 checkpoint, when we make new701 checkpoints; This field was only defined702 in the first log file of a log group */703 #define LOG_CHECKPOINT_2 (3 * os_file_ log_block_size) 704/* Second checkpoint field in the log705 header */706 #define LOG_FILE_HDR_SIZE (4 * os_file_log_block_size)
The log file header occupies a total of 4 os_file_log_block_size size, here is a brief introduction to some fields: 1. log_group_id This log file belongs to the journal group, which occupies 4 bytes and is currently 0;2.  LOG_FILE_START_LSN The LSN of the initial data recorded by this log file, which occupies 8 bytes; 3. log_file_was_crated_by_hot_backup The number of bytes occupied by the backup program, which takes up 32 bytes, such as Xtrabackup in Xtrabackup_ "Xtrabackup Backup_time" is recorded in the logfile file; 4. log_checkpoint_1/log_checkpoint_2 Two records InnoDB CHECKPOINT information fields, starting with the second and fourth blocks of the file header, respectively, Only the first log file of the log file group is used. Here are two more sentences, InnoDB need to update the values of both fields after each checkpoint, so redo log writes are not strictly sequential; Each log file contains many log records. Log records will write the log file in the order of Os_file_log_block_size (the default is 512 bytes). Each record has its own LSN (log sequence number, which indicates how many bytes have been written to a particular log record from the start of logging creation). Each log block consists of a header segment, a tailer segment, and a set of log records. First look at the log Block header. The beginning of the Block header 4 bytes is the log block number, which indicates that this is the first block of blocks. It is calculated from the LSN, the function of the calculation is log_block_convert_lsn_to_no (); The next two bytes indicate how many bytes have been used in the block, and then two bytes in the block as a new MTR start log The offset of the record, due to a bloThe CK can contain more than one MTR record log, so a record is required to represent this offset. Then four bytes represents the checkpoint number of the block. Block trailer occupies four bytes, indicating the checksum value computed by this log block for correctness verification, MySQL5.6 provides several algorithms for calculating checksum, which are not discussed here. We can combine the comments given in the code to understand the meanings of each field in the next header and trailer.
580/* Offsets of a log block header */581 #define LOG_BLOCK_HDR_NO 0/* block number which must be > 0 and582 is allowed to wrap around at 2G; the583 highest bit is set to 1 if the is the584 first log block in a log flush write585 seg ment */586 #define LOG_BLOCK_FLUSH_BIT_MASK 0x80000000ul587/* MASK used to get the highest BIT in588 The preceding field */589 #define LOG_BLOCK_HDR_DATA_LEN 4/* Number of bytes of LOG written to590 this BLOCK * /591 #define LOG_BLOCK_FIRST_REC_GROUP 6/* Offset of the first start of an592 MTR LOG record GROUP in this log block,593 0 if none; If the value is the same594 as Log_block_hdr_data_len, it means595 the first rec group have not ye t596 been catenated to this log block, but597 if it would, it'll start at this598 offset; An archive recovery can599 start parsing the log records starting600 From this offset in this log block,601 if value not 0 */602 #define LOG_BLOCK_CHECKPOINT_NO 8/* 4 Lowe R bytes of the value of603 Log_sys->next_checkpoint_no when the604 log block is last written to:i F the605 block have not yet been written full,606 this value was only updated before a607 log Buffer Flush */608 #define LOG_BLOCK_HDR_SIZE/* SIZE of the LOG BLOCK header in609 bytes */610611/* OFFSE TS of a log block trailer from the end of the block */612 #define LOG_BLOCK_CHECKSUM 4/* 4 byte CHECKSUM of the log bloc k613 contents; In InnoDB versions614 < 3.23.52 This does not contain the615 checksum but the same value as616 .. _hdr_no */617 #define LOG_BLOCK_TRL_SIZE 4/* Trailer SIZE in bytes */
Log Record GenerationAfter describing the structure of log file and log block, the next step is to describe how the log record is generated inside the InnoDB, and how its "lifecycle" is in memory and eventually written to disk. This involves two memory buffers, involving mtr/log_sys and other internal structures, followed by a brief introduction. First introduce the next Log_sys. Log_sys is a global structure innodb in memory (struct named Log_t,global object named Log_sys), which maintains a global memory area called Log buffer (LOG_SYS->BUF). Also maintains a number of LSN values and other information indicating the state of the logging. It allocates all internal areas in the Log_init function and initializes each variable. LOG_T structure is very large, here no longer stick out, you can see "Storage/innobase/include/log0log.h:struct log_t". Below is a description of the field values that are more important:
Log_sys->lsn |
Next, the log record to be generated will use the value of this LSN |
Log_sys->flushed_do_disk_lsn |
The redo log file has been flushed to this LSN. Log records that are smaller than the LSN value are safely logged on disk |
Log_sys->write_lsn |
The critical LSN value used by the write operation that is currently executing; |
Log_sys->current_flush_lsn |
The critical LSN value used by the Write + flush operation that is currently executing, and is generally equal to LOG_SYS->WRITE_LSN; |
Log_sys->buf |
The in-memory global log buffer differs from each of the MTR's own buffer; |
Log_sys->buf_size |
Size of the LOG_SYS->BUF |
Log_sys->buf_free |
The starting offset of the write buffer |
Log_sys->buf_next_to_write |
The starting offset of the log file is not yet written in buffer. The next time the Write+flush operation is performed, it will start at this offset |
Log_sys->max_buf_free |
Determines the point in time at which the flush operation executes, and the flush operation is required when the log_sys->buf_free is larger than this value, depending on the Log_check_margins function |
LSN is the link between the dirty Page,redo log record and the redo log file. An associated LSN is generated when each redo log record is copied to the log buffer of memory, and each page is modified to produce a log record, so that the page of each database has an associated LSN. This LSN is recorded in the header field of each page. To ensure the logic required by the Wal (write-ahead-logging), the dirty page requires that the log record of its associated LSN be written to log file to allow the flush operation to be performed. The following is the introduction of MTR. MTR is the abbreviation of mini-transactions. The corresponding structure in the code is mtr_t, with a local buffer inside, which will centralize a set of log record and write to log buffer in bulk. The structure of the mtr_t is as follows:
376/* Mini-transaction handle and buffer */377 struct mtr_t{378 #ifdef univ_debug379 ulint State; /*!< mtr_active, mtr_committing, mtr_committed */380 #endif381 dyn_array_t memo; /*!< Memo stack for locks etc. */382 dyn_array_t log; /*!< mini-transaction log */383 unsigned inside_ibuf:1;384/*!< TRUE if inside ibuf changes */385 unsig Ned modifications:1;386/*!< TRUE if the mini-transaction387 modified buffer pool pages */388 unsign Ed made_dirty:1;389/*!< TRUE If MTR have made at least390 one buffer pool page dirty */391 ulint n _log_recs;392/* Count of how many page initial log records393 has been written to the MTR Log */394 ul int n_freed_pages;395/* Number of pages that has been freed in396 this mini-transaction */397 ulint Log_mode; /* Specifies which operations should be398 logged; Default value Mtr_log_all */399 lsn_t start_lsn;/* start LSN of THe possible log entry for400 this MTR */401 lsn_t end_lsn;/* End LSN of the possible log entry for402 This MTR */403 #ifdef univ_debug404 ulint magic_n;405 #endif/* Univ_debug */406};
Mtr_t::log --as the local cache for MTR, log record;mtr_t::memo --contains a list of dirty pages caused by the operations involved by the MTR. It is added to the flush list after Mtr_commit execution (see Mtr_memo_pop_all () function); A typical scenario for mtr is as follows: 1. Create an object of type mtr_t; 2. Execute the Mtr_start function, this function initializes the mtr_t field, including the local buffer;3. While modifying the page in memory BP, call mlog_write_ulint similar function, generate redo log record, save in local buffer; 4. Executes the Mtr_commit function, which copies the redo log from the local buffer to the global log_sys->buffer and adds the dirty page to the flush list, Used for subsequent flush operations; the mtr_commit function calls Mtr_log_reserve_and_write, which in turn calls Log_write_low to perform the copy operation described above. If required, this function creates a new log block on the Log_sys->buf, fills the header, Tailer, and calculates the checksum. We know that in order to ensure atomicity and persistence in the ACID properties of the database, in theory, the redo log should have been securely written to the disk file when the transaction was committed. Back to MySQL, when and how Log_sys->buffer in the file memory is written to disk redo log file is closely related to the Innodb_flush_log_at_trx_commit settings. Both the DBA and the MySQL user are already familiar with this parameter, and here is a direct example of how the log subsystem operates when different values are taken. &NBSP;INNODB_FLUSH_LOG_AT_TRX_COMMIT=1/2. The redo log is written every time the transaction commits, except that the 1 corresponds to write+flush,2 write only, and the flush operation is performed periodically by the specified thread (more than 1s for the period). The function that performs the write operation is Log_group_write_buf, which is called by the log_write_up_to function. A typical call stack is as follows:
(Trx_commit_in_memory () /trx_commit_complete_for_mysql () /trx_prepare () e.t.c)->trx_flush_log_if_ Needed ()->trx_flush_log_if_needed_low ()->log_write_up_to ()->log_group_write_buf ().
The log_group_write_buf will then invoke the underlying IO system of the InnoDB package, which is very complex to implement and no longer be expanded. Innodb_flush_log_at_trx_commit=0, each transaction commit no longer calls the function that writes redo log, its write logic is completed by Master_thread, the typical call stack is as follows:
Srv_master_thread (), (Srv_master_do_active_tasks ()/Srv_master_do_idle_tasks ()/Srv_master_do_shutdown_tasks ( ))->srv_sync_log_buffer_in_background ()->log_buffer_sync_in_background ()->log_write_up_to ()-> ....
In addition to the effects of this parameter, there are scenarios where the redo log file needs to be refreshed. Here are a few examples: 1) in order to ensure that the write ahead logging (WAL), before refreshing the dirty page requires its corresponding redo log has been written to disk, so need to call the Log_write_up_to function, 2) in order to recycle log file, in the log Checkpoint (synchronous or asynchronous) is required when file space is insufficient, and the log refresh operation is performed by calling Log_checkpoint. Checkpoint will greatly affect the performance of the database, which is the main reason why log file cannot be set too small; 3) when executing some administrative commands, ask to refresh the redo log file, such as closing the database; Here is a brief summary of the "life cycle" of a log record: 1. The Redo log record is first generated by the MTR and is stored in the local buffer of the MTR. The Redo log record saved here requires all the information required to record the recovery phase of the database, and requires that the recovery operation be idempotent; 2. When Mtr_commit is called, the Redo log record is recorded in the log buffer of global memory; 3. As needed (additional space required?) Transaction commit? ), redo log buffer will be write (+flush) to the Redo log file on disk, at which time redo log is saved safely; 4. When executed, the mtr_commit generates an LSN for each log record, which determines its position in the log file; 5. The LSN is also the link between redo log and dirty page, and Wal requires redo log to be written to the disk before the brush is dirty, and if the LSN-associated page has been written to the disk, the corresponding log in the redo log file on the disk The record space can be recycled; 6. Database recovery phase, using the persisted redo log to recover the database, followed by the redo log in the database recovery phase played an important role.
Log RecoveryThe function entry for the InnoDB recovery is innobase_start_or_create_for_mysql, which is called by the Innobase_init function when MySQL is started. Next we look at the source code, in this function can see the following two function calls: 1. Recv_recovery_from_checkpoint_start2. The Recv_recovery_from_checkpoint_finish code note specifically emphasizes that, in any case, the database will attempt to perform a recovery operation when it starts, as part of the normal code path when the function starts. The main recovery work is done within the first function, and the second function does the cleanup work. Here, a direct look at the function's comments can clarify what the function's specific work is.
146/** Wrapper for Recv_recovery_from_checkpoint_start_func (). 147 recovers from a checkpoint. When this function returns, the database was able148 to start processing of new user transactions, but the function149 recv _recovery_from_checkpoint_finish should be called later to complete150 the recovery and free the resources used in it.151 @param type In:log_checkpoint or log_archive152 @param lim in:recover up to this LOG sequence number if possible153 @param min in:minimum flushed log sequence number from data files154 @param max in:maximum flushed log sequence Number from data files155 @return error code or db_success */156 # define Recv_recovery_from_checkpoint_start (type,lim,min , max) 157 Recv_recovery_from_checkpoint_start_func (Type,lim,min,max)
In contrast to the log_t structure, the recovery phase also has a structure called recv_sys_t, which is passed Recv_sys_create and recv_sys_ in the Recv_recovery_from_checkpoint_start function. Init two function initialization. There are also several LSN-related fields in recv_sys_t, which are described here.
Recv_sys->limit_lsn |
The maximum LSN value that should be performed to restore, where the value is Lsn_max (the maximum value of uint64_t) |
Recv_sys->parse_start_lsn |
Restores the most initial LSN value used in the parse log phase, which is equal to the LSN value corresponding to the last execution of the checkpoint |
Recv_sys->scanned_lsn |
The LSN value currently scanned to |
Recv_sys->recovered_lsn |
The currently restored LSN value, which is less than or equal to RECV_SYS->SCANNED_LSN |
The PARSE_START_LSN value is the starting point of the recovery, which is obtained through the Recv_find_max_checkpoint function and reads the value of the LOG file Log_checkpoint_1/log_checkpoint_2 field. After acquiring START_LSN, the Recv_recovery_from_checkpoint_start function calls the Recv_group_scan_log_recs function to read and parse the log records. Let's focus on the Recv_group_scan_log_recs function:
2908/*******************************************************//**2909 Scans Log from a buffer and stores new log data to T He parsing buffer. Parses2910 and hashes the log records if new data found. */2911 static2912 void2913 recv_group_scan_log_recs (2914/*=====================*/2915 log_group_t* Group,/*!< I N:log group */2916 lsn_t* contiguous_lsn,/*!< in/out:it is known-all log2917 groups contain cont Iguous log Data up2918 to the LSN */2919 lsn_t* group_scanned_lsn)/*!< Out:scanning succeeded up to292 0 this LSN */2930 while (!finished) {2931 end_lsn = start_lsn + recv_scan_size;29322933 log_group_read _log_seg (Log_recover, log_sys->buf,2934 Group, START_LSN, END_LSN); 29352936 finished = Recv_scan_log _recs (2937 (Buf_pool_get_n_pages () 2938-(Recv_n_pool_free_frames * srv_buf_pool_instances)) 2939 * UNIV_ page_size,2940 TRUE, Log_sys->buf, recv_scan_size,2941START_LSN, CONTIGUOUS_LSN, GROUP_SCANNED_LSN); 2942 start_lsn = end_lsn;2943}
Inside this function is a while loop. The LOG_GROUP_READ_LOG_SEG function first reads the log record into a memory buffer (this is log_sys->buf), and then calls the Recv_scan_log_recs function to parse the log record. The parsing process calculates the checksum of the log block and whether the block no and LSN correspond. When the parsing process is complete, the parsing results are stored in the Recv_sys->addr_hash maintained hash table. The key of this hash table is computed by Space ID and page number, and value is a set of parsed log record that is applied to the specified page and is no longer expanded. When the above steps are complete, the Recv_apply_hashed_log_recs function may be called in the Recv_group_scan_log_recs or Recv_recovery_from_checkpoint_start function. This function applies the log in Addr_hash to a specific page. This function calls the Recv_recover_page function to do a true page recovery operation, at which point the page's LSN is judged to be smaller than the LSN of the log record.
/** Wrapper for Recv_recover_page_func (). 106 applies the Hashed log records to the page, if the page LSN was less than the107 LSN of a log record. This can is called when a buffer page have just been108 read in, or also for a page already in the buffer pool.109 @param j RI in:true If just read in (the I/O handler calls this for110 a freshly read page) 111 @param block in/out:the BU Ffer block112 */113 # define Recv_recover_page (Jri, block) Recv_recover_page_func (JRI, block)
As above is the entire page of the recovery process. Attached to a problem, follow-up will be redo log related issues recorded here. 1. The relationship between Q:log_file, Log_block, Log_record?
The a:log_file consists of a set of log blocks, each of which is of a fixed size. Log block except for Header\tailer bytes are recorded log record
2. Q: Is not every commit, should produce a log_block?
A: This is not necessarily. Write Log_block determined by Mtr_commit, not by transaction commit. Look at the log record size, and if the size does not need to cross the log block, it will continue to be written in the current log block.
3. What is the structure of Q:log_record?
A: This structure is many, also does not have the fine study, the concrete looks after the Bendenbo diagram the brief introduction bar;
4. Q: should each block have the next block offset, or the order, or the next one block_number
A:block are fixed-size, sequential-written
5. Q: How do you know if this block is complete or is it dependent on the next block?
A:block begins with 2 bytes that record the position of the first MTR in this block, and if this value is 0, the same MTR as the previous block.
6. Q: A transaction is not required for multiple mtr_commit
A: Yes. MTR's M = = mini;
7. Q: Are these log_block in the middle of a commit?
A:mtr_commit will write to log buffer, when it is written to the log file is not necessarily
8. Q: How is the LSN written?
The A:LSN is the equivalent of the position in the log file, so the LSN is determined when writing to log buffer. There is currently only one log buffer, and the position in the log buffer is consistent with the position in the log file
9. Q: What do I do when I commit?
A: May write log, may not write, by innodb_flush_log_at_trx_commit this parameter decides AH
Q: What are these two values for?: Log_checkpoint_1/log_checkpoint_2
A: These two can be understood as part of the log file header (take up the second and fourth blocks of the file header), each execution of checkpoint need to update the two fields, the subsequent recovery, each page corresponding LSN is smaller than the checkpoint value, it is considered to have been written, No need to restore the article Finally, the NetEase Hang Research Institute Dr. Ho Dengcheng-Golden blog on a log block of the structure of this, and then do not draw Bidenbork the picture is clearer, the copyright belongs to Golden.
ZZ MySQL Redo Log and recover process analysis