MySQL series: innodb source code analysis-redo log structure, mysqlinnodb

Source: Internet
Author: User

MySQL series: innodb source code analysis-redo log structure, mysqlinnodb

In innodb Engine implementation, a redo log system is built to ensure transaction persistence. Redo log is composed of two parts: the memory log buffer and the redo log file. The purpose of this design is obvious. The log buffer is to speed up log writing, and redo log files to provide persistence for log data. In the innodb redo log system, the following concepts are introduced in order to better achieve log recoverability, security, and persistence: LSN, log block, log file group, checkpoint, and archive logs. Here we will analyze them one by one.

1. In the innodb redo log system, the LSN defines an LSN number, which indicates the log Sn. The LSN defines a dulint_t type value in the engine, which is equivalent to uint64_t. The definition of dulint_t is as follows:

typedef struct dulint_struct{     ulint high;     /* most significant 32 bits */     ulint low;       /* least significant 32 bits */}dulint_t;
The true meaning of LSN is the number of logs (in bytes) written by the storage engine to the redo log system. This log volume includes the number of log bytes written + block_header_size + block_tailer_size. The initial value of LSN is LOG_START_LSN (equivalent to 8192). When calling the log writing function, LSN will always increase with the length of the written log. For details, refer:

Void log_write_low (byte * str, ulint str_len) {log_t * log = log_sys ;... part_loop:/* calculate part length */data_len = log-> buf_free % OS _FILE_LOG_BLOCK_SIZE + str_len ;... /* copy the log Content to log buffer */ut_memcpy (log-> buf + log-> buf_free, str, len); str_len-= len; str = str + len ;... if (data_len = OS _FILE_LOG_BLOCK_SIZE-LOG_BLOCK_TRL_SIZE) {/* write a block */... len + = LOG_BLOCK_HDR_SIZE + LOG_BLOCK_TRL_SIZE; log-> lsn = ut_dulint_add (log-> lsn, len );...} else/* Change lsn */log-> lsn = ut_dulint_add (log-> lsn, len );...}

The LSN is not reduced. It is the unique identifier of the log location. There are lsn in the redo log write, checkpoint build, and PAGE header.

About log writing:

For example, if the LSN of the current redo log is 2048, innodb calls log_write_low to write a log with a length of 700, and 2048 is exactly the length of 4 blocks, the log with a length of 700 needs to be stored, a block is required (a single block can only store 496 bytes ). Then it is easy to obtain the new LSN = 2048 + 700 + 2 * LOG_BLOCK_HDR_SIZE (12) + LOG_BLOCK_TRL_SIZE (4) = 2776.

About checkpoint and log recovery:

The LSN in the fil_header of the page indicates that the last refresh is the LSN. If PAGE1 LSN = 1024 and PAGE2 LSN = 2048 exist in the database, the last checkpoint LSN = 1024 is detected when the system restarts, the system detects that PAGE1 will not recover and redo PAGE1. When the system detects PAGE2, PAGE2 will be redone. Likewise, the pages smaller than the checkpoint LSN do not need to be redone. Pages larger than the LSN checkpoint must be redone.

2. log Blockinnodb defines the concept of log block in the log system. In fact, Log block is a 512-byte data block, which includes the block header, log information, and block checksum. its structure is as follows:

The highest bit of Block no indicates whether the block flush disk is flushed. blockno can be used through lsn. The specific calculation process is an integer multiple of 512 of the number of lsn, that is, no = lsn/512 + 1. Why should we add 1, because the no block is counted as clac_lsn, it must be smaller than the incoming lsn. so we need + 1. It is actually the array index value of the block. Checksum overwrites a number by starting from the block header to the first four bytes at the end of the block. The Code is as follows:

sum = 1; sh = 0; for(i = 0; i < OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE, i ++){      sum = sum & 0x7FFFFFFF;      sum += (((ulint)(*(block + i))) << sh) + (ulint)(*(block + i));      sh ++;      if(sh > 24)         sh = 0; }
When the log is restored, innodb checks the loaded block checksum to avoid data errors during the restoration process. Transaction logs are written based on blocks. If the transaction log is smaller than 496 bytes, other transaction logs are merged into one block. If the transaction log is larger than 496 bytes, the storage is separated by 496 characters. For example, T1 = 700 bytes, and T2 = 100 bytes, the storage structure is as follows:


3. redo log structure and relationship graph innodb designs three layer modules in the implementation of redo log buffer, group files, and archive files. The three layer modules are described as follows:

Redo log buffer rewrites the log memory buffer of the log. newly written logs are first written to this location. Data in redo log buffer is synchronized to the disk and must be flushed to the disk.

Group files redo the log file group, which generally consists of three files of the same size. Three files are written cyclically. When each log file is full, it is written to the next one. If all the log files are full, it will overwrite the first write again. Multiple redo log groups are supported in the innodb design.

Archive files archive log files are incremental backups of the redo log files, which do not overwrite the previous log information.

Their relationships are as follows:

3.1 redo log Group

Redo log groups can support multiple log groups. The purpose of this operation is to prevent a log group from being damaged and recover data from other parallel log groups. Setting the number of log groups in the MySQL-5.6 to 1 does not allow many groups to exist. Netease Jiang chengyao's explanation is that innodb's authors believe that the log group integrity can be ensured through the outer storage hardware, such as raid disks. The main function of redo a log group is to manage the writing of files in the group, establish checkpoints in the group, save checkpiont information, and manage the archived log status (archive operations are performed only for the first group ). the following is a definition of a log group:

Typedef struct log_group_struct {ulint id;/* log group id */ulint n_files;/* Number of log files contained in the group */ulint file_size;/* log file size, including the file header */ulint space_id;/* id of fil_space corresponding to the group */ulint state;/* log group status, LOG_GROUP_ OK, log_group_upted */dulint lsn; /* log group lsn */dulint lsn_offset;/* offset of the current lsn relative to the start position of the file in the group */ulint n_pending_writes; /* Number of fil_flush executions in this group */byte ** file_header_bufs;/* File Header Buffer */byte ** archive_file_header_bufs;/* buffer for archiving file header information */ulint archive_space_id; /* ID of the archived redo log */ulint archived_file_no;/* ID of the archived Log File */ulint archived_offset;/* offset of the archived log */ulint next_archived_file_no; /* file number of the next archive */ulint next_archived_offset;/* offset of the next archive */dulint scanned_lsn; byte * checkpoint_buf; /* this log group stores the buffer for checkpoint information */UT_LIST_NODE_T (log_group_t) log_groups;} log_group_t;

The spaceid defined in the above structure corresponds to the fil_space_t structure in fil0fil. A fil_space_t structure can manage multiple files fil_node_t. For more information about fil_node_t, see here.

3.1.1LSN and intra-group offset are in the log_goup_t Log Module. What is important is the Conversion Relationship Between LSN and intra-group offset. When a group is created, the lsn and the corresponding lsn_offset are set. For example, if the initialization is group lsn = 1024, group lsn_offset = 2048, and group is composed of three 10240-sized files, LOG_FILE_HDR_SIZE = 2048, we need to know the offset of the intra-group offset corresponding to buf lsn = 11240. According to the log_group_calc_lsn_offset function, the following formula can be obtained:
Group_size = 3*11240;
LSN offset = (buf_ls-group_ls) + log_group_calc_size_offset (lsn_offset) = (11240-1024)-0 = 10216;
Lsn_offset = log_group_calc_lsn_offset (LSN offset % group_size relative to the starting position of the group) = 10216 + 2 * LOG_FILE_HDR_SIZE = 14312;
The offset must contain the length of the file header.

3.1.2 file_header_bufs

File_header_bufs is a buffer array. The array length is the same as the number of files in the group, and the length of each buf is 2048. The information structure is as follows:


Log_group_id corresponds to the id in the log_group_t Structure

File_start_lsn the LSN value corresponding to the actual location data of the current file

The current file number of File_no, which is generally reflected in the archive file header.

Hot backup str is an empty string. If it is hot_backup, the file suffix ibackup will be filled in.

The LSN value corresponding to the end data of the File_end_ls file, which is generally reflected in the archive file.

3.2 checkpoint

The checkpoint is the log checkpoint. After a database exception occurs, the redo log obtains the LSN from the information of the log point, and restores the logs and pages after the check point. How is the checkpoint generated? When the log LSN written in the log buffer reaches a certain gap from the LSN of the last checkpoint generated, the checkpoint is created, when creating a checkpoint, the dirty data of the table in the memory is first written to the hard disk, and then the logs in the redo log buffer smaller than the LSN of the current checkpoint are written to the hard disk. In the checkpoint_buf of log_group_t, the following describes the corresponding fields:

LOG_CHECKPOINT_NO checkpoint No,

LOG_CHECKPOINT_LSN the starting LSN of this checkpoint

LOG_CHECKPOINT_OFFSET the start offset of this checkpoint relative to the group file

LOG_CHECKPOINT_LOG_BUF_SIZE redo log buffer. The default value is 2 MB.

LOG_CHECKPOINT_ARCHIVED_LSN the LSN of the current log Archive

LOG_CHECKPOINT_GROUP_ARRAY the Object Sequence Number and offset when each log group is archived. It is an array

3.3 log_t

Redo log writing, data flushing, checkpoint creation, and archiving operations are globally unique. log_sys controls the operation. This is a very large and complex structure and is defined as follows:

Typedef struct log_struct {byte pad [64];/* enables log_struct objects to be placed in the general cache line. This is directly related to CPU L1 Cache and data competition */dulint lsn; /* log serial number, which is actually a log file offset */ulint buf_free;/* location where the buf can be written */mutex_t mutex; /* mutex */byte * buf protected by log;/* log buffer */ulint buf_size;/* log buffer length */ulint max_buf_free;/* after flushing log buffer, we recommend that you set the maximum value of buf_free. If this value is exceeded, the disk will be forcibly flushed */ulint old_buf_free;/* the value of buf_free when the last write is used for debugging */dulint old_lsn; /* The lsn of the last write, used for debugging */ibool check_flush_or_checkpoint;/* the ID of a log checkpoint to be written to the disk or refreshed */ulint buf_next_to_write; /* the location of the buf offset from the next write operation to the disk */dulint written_to_some_lsn;/* The lsn of the first group refresh completed */dulint written_to_all_lsn; /* The lsn that has been recorded in the log file */dulint flush_lsn;/* flush lsn */ulint flush_end_offset;/* buf_free during the last log file flushing, that is, the last offset at the end of the last flush */ulint n_pending_writes;/* The number of fil_flush calls */OS _event_t no_flush_event;/* this signal is triggered only after all fil_flush is complete, wait until all the goups flushing is completed */ibool one_flushed;/* after a log group is flushed, the value is set to TRUE */OS _event_t one_flushed_event; /* this signal will be triggered when a group is flush completed */ulint n_log_ios;/* Number of io operations on the log system */ulint n_log_ios_old; /* io operation count during the previous Statistics */time_t last_printout_time; ulint max_modified_age_async;/* asynchronous Log File flushing threshold */ulint max_modified_age_sync; /* synchronous Log File flushing threshold */ulint adm_checkpoint_interval; ulint threshold;/* asynchronously sets the checkpoint threshold */ulint max_checkpoint_age;/* forces the checkpoint threshold */dulint next_checkpoint_no; dulint last_checkpoint_lsn; dulint next_checkpoint_lsn; ulint struct; rw_lock_t checkpoint_lock;/* rw_lock_t of the checkpoint, exclusive to this latch */byte * checkpoint_buf; /* buf */ulint archiving_state of checkpoint information storage; dulint archived_lsn; dulint vertex; ulint archiving_phase; ulint vertex; Specify archive_lock; ulint vertex; byte * handle; OS _event_t archiving_on; ibool online_backup_state;/* Whether the lsn is in backup */dulint online_backup_lsn;/* The lsn at backup */Queue (log_group_t) log_groups;
3.3.1 what is the direct relationship between various lsns? Understanding the relationship between these lsns will have great confidence in understanding the operating mechanism of the entire redo log system. Explanation of the following LSN:

Lsn the LSN of the last log written by the current log system

Flush_lsn the LSN at the end of the last data disk flushing data in redolog buffer as the starting lsn of the next disk Flushing

Written_to_some_lsn the starting LSN of the last log flushing operation in a single log Group

Written_to_all_lsn the last log flushing of all log groups is the starting LSN

Last_checkpoint_lsn the last time the LSN starts when the checkpoint log data is created

Next_checkpoint_lsn the LSN at which the checkpoint log data starts to be created next time, obtained using log_buf_pool_get_oldest_modification

Archived_lsn the LSN starting from the last log data archiving

Next_archived_lsn

The diagram is as follows:


3.3.2 offset Analysis

Log_t has various offsets, such as max_buf_free, buf_free, flush_end_offset, and buf_next_to_write. The offset is different from the LSN. The offset is the absolute offset relative to the actual location of the redo log buf, And the LSN is the sequence number of the entire log system.

The max_buf_free offset cannot be exceeded. If it is exceeded, the redo log buf is forced to be written to the disk.

Buf_free offset position that can be written in the current log

Buf_next_to_write the start offset of the next redo log buf data write disk. After all the disk I/O operations are completed, the value is consistent with that of flush_end_offset.

Flush_end_offset: the offset at the end of the data on the current disk, which is equivalent to buf_free during disk flushing. When flush_end_offset exceeds half of max_buf_free, unwritten data is moved to the front of redobuffer, in this case, both buf_free and buf_next_to_write will be adjusted.

The size diagram is as follows:


3.4 memory structure diagram


4. Log writing and log protection mechanisms

Innodb has four types of log flushing behaviors: asynchronous redo log buffer flushing, synchronous redo log buffer flushing, asynchronous checkpoint flushing, and synchronous checkpoint flushing. In innodb, disk flushing is very disk IO-consuming. innodb has made a perfect set of policies for disk flushing.

 

4.1 redo log disk flushing options

There is a global variable srv_flush_log_at_trx_commit In the innodb engine. This global variable is used to control the flushdisk policy, that is, to determine when to call the fsync function. This variable has three values. The three values are explained as follows:

The MasterThread controls the redo log module to call log_flush_to_disk to fl the disk every one second. The advantage is that the log and data will be lost if the database crashes within one second.

1. After each redo log is written, fsync is called to write logs to the disk. The advantage is that each log is written to a disk, and the data reliability is greatly improved. The disadvantage is that each call to fsync produces a large amount of disk IO, which affects the database performance.

2. After each redo log is written, the log is written to the page cache of the log file. In this case, if the host crashes, all logs will be lost.

4.2 log disk flushing Protection

Because the redo log is a process of repeatedly writing multiple files in a group, it means that if the log is not written to the disk in time or the checkpoint is created, it may overwrite the log, this is something we don't want to see. A log protection mechanism is defined in innodb. The storage engine regularly calls the log_check_margins log function to check the protection mechanism. A brief introduction is as follows:

Three Variables buf_age, checkpoint_age, and log space are introduced.

Buf_age = lsn-oldest_lsn;

Checkpoint_age = lsn-last_checkpoint_lsn;

Log space size = number of bytes that can be stored by the redo log group (obtained through log_group_get_capacity );

When buf_age> = 7/8 of the log space size, the redo log system will asynchronously fl the red log buffer data disk. This is asynchronous and will not cause data operation blocking.


When buf_age> = 15/16 of the size of the log space, the redo log system will fl the redlog buffer for data synchronization. In this case, the fsync function is called and database operations will be blocked.

When checkpoint_age> = 31/32 of the log space size, the log system creates a checkpoint asynchronously, and database operations are not blocked.

 

When the checkpoint_age = log space is large, the log system will create checkpoints synchronously. A large number of dirty pages in the tablespace and dirty pages in the log file will be synchronized to the disk, resulting in a large number of disk IO operations. Database operations will be blocked. The entire database transaction is suspended.

5. Summary

The redo log system of Innodb is quite complete and has made many minor considerations for data persistence. Its efficiency directly affects the write efficiency of MySQL, so we understand it in depth and optimize it, especially when a large amount of data is flushed. Assume that the transaction processing speed of the database is higher than the flushing speed of disk I/O, the checkpoint operation will be created synchronously, so that the database is congested and the whole database is flushing dirty pages. To avoid such a problem, I/O capability is increased, and I/O pressure is dispersed by multiple disks. You can also consider SSD, a storage medium with high read/write speed, for optimization.

 

 




Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.