MySQL Series: InnoDB source analysis Redo Log structure

Last Update:2015-01-07 Source: Internet

Author: User

Tags array length mutex

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the implementation of InnoDB engine, the redo log system is built in order to realize the persistence of the transaction. The redo log consists of two parts: the Memory log buffer (redo log buffers) and the redo log file. The purpose of this design is obvious: The log buffers are designed to speed up the logging, and redo log files provide persistence for log data. In InnoDB's redo log system, the following concepts have been introduced in order to achieve better log recoverability, security, and durability: LSN, log block, log file group, checkpoint, and archive log. Below we separately hit carries on the analysis.

1.LSNIn the redo log system in InnoDB , an LSN ordinal is defined, which means the log sequence number. The LSN is defined in the engine as a dulint_t type value, equivalent to uint64_t, and the definition of dulint_t is as follows:

typedef struct dulint_struct{     ulint high;     /* Most significant-bits */     ulint low;       /* least significant */}dulint_t;

LSNThe real implication is the amount of log (bytes) written by the storage engine to the redo log system, which includes the log bytes written + block_header_size + block_tailer_size. The initialization value of the LSN is:log_start_lsn(equivalent to 8192), and the log write function LSN has been increased with the length of the journal being written, depending on :

void Log_write_low (byte* str, ulint str_len) {log_t* log = Log_sys; .... Part_loop:/* Calculate part length*/Data_len = Log->bu F_free% os_file_log_block_size + str_len;  .... /* Copy the log contents to log buffer*/ut_memcpy (log->buf + log->buf_free, str, len); Str_len-= Len; str = str + len; ... if (Data_len = os_file_log_block_size-log_block_trl_size) {/* complete a BLOCK write */...      Len + = log_block_hdr_size + log_block_trl_size;      LOG->LSN = Ut_dulint_add (LOG->LSN, Len); . . . } else/* Change lsn*/  log->lsn = Ut_dulint_add (LOG->LSN, Len); ...}

LSN is not reduced, it is a unique token of the log location. There are LSN in redo log write, checkpoint build, and page header.

About Log writes :

For example, the LSN of the current redo log = 2048, when InnoDB calls log_write_low writes a log with a length of.2048 Just 4 blocks long , you need to store a log of length, a block ( a single block can only be stored 496 bytes ). Then it is easy to draw a new LSN = 2048 + 2 * log_block_hdr_size (+) + log_block_trl_size (4) = 2776.

About checkpoint and log recovery:

The LSN in the page's Fil_header is the LSN that represents the last flush, if the PAGE1 LSN = 1024,page2 LSN = 2048 is present in the database, the last checkpoint LSN = 1 is detected when the system restarts 024, then the system detects that the PAGE1 will not redo the PAGE1, and when the system detects PAGE2, PAGE2 will be re-made. Once again, pages smaller than the checkpoint LSN do not have to be redo, and a page larger than the LSN checkpoint will be re-made.

2.Log BlockInnoDB defines the concept of log block in the logging system, in fact, the log block is a data block of bytes, This data block includes the block size, log information, and the checksum of blocks . The structure is as follows:

block no block whether Span lang= the identity bit of the "en-US" >flush disk. through lsn can blockno, the specific calculation process is lsn is an integer multiple of 512 , which is no = lsn/512 + 1; Why add 1 , because the block no is counted as clac_lsn Must be smaller than the incoming lsn. So to +1 . is actually the array index value of block . The checksum is a digital overlay that starts with the first 4 bytes from the beginning of the block to the end of the chunk, with the following code:

sum = 1; SH = 0; for (i = 0; i < os_file_log_block_size-log_block_trl_size, i + +) {      sum = sum & 0x7FFFFFFF;      Sum + = (((ulint) (* (block + i)) << sh) + (Ulint) (* (block + i));      SH + +;      if (Sh >)         sh = 0;}

At the time of the log recovery, InnoDB will checksum the loaded block to avoid data errors during the recovery process. Log writes for transactions are block-based, and if the log size of the transaction is less than 496 bytes, the rendezvous of the other transaction logs is combined in a block, and if the transaction log size is greater than 496 bytes, the storage is separated by a length of 496. For example: T1 = 700 byte size, T2 = 100 byte size The storage structure is as follows:

3. Redo log structure and diagram InnoDB in the Redo log implementation, 3 layer modules were designed, namely redo log buffer, group files, and archive files. The three layer modules are described below:

Redo Log Buffer redo log Memory buffer, the newly written log is written to this place first. Redo Log buffer data is synchronized to disk, and a brush disk operation is required.

Group files redo log filegroups, typically consisting of 3 files of the same size. 3 files are written sequentially, each log file is full, that is, write the next one, if the log file is full, will overwrite the first time the re-write. Redo log Groups support multiple designs on InnoDB.

Archive Files Archives The log file, which is an incremental backup of the redo log file, which does not overwrite the previous log information.

Here are their relationships :

3.1Redo Log Group

Redo Log groups can support multiple, and this should be done in order to prevent one log group from being corrupted and to recover data from other parallel log groups. Setting the number of log groups to 1 in MySQL-5.6 does not allow multiple group to exist. NetEase Kang's explanation is that InnoDB's author believes that the integrity of the log group is better, such as a RAID disk, through an outer storage hardware. The main function of redo Log group is to implement the management of the files in the group, the checkpoint establishment in the group and the Checkpiont information, Archive Log State management (only the first group does the archive operation). The following is the definition of a log group :

typedef struct log_group_struct{ulint ID;                             /*log group id*/ulint n_files;                     /*group contains the number of log files */Ulint file_size;                  /* log file size, including file header */Ulint space_id;                 /*group corresponding to the fil_space id*/ulint state;                        /*log GROUP status, LOG_GROUP_OK, log_group_corrupted*/dulint LSN;                         /*log Group's lsn*/dulint Lsn_offset;             /* Offset of the current LSN relative to the starting position of the file in the group */Ulint n_pending_writes; /* This group is executing the number of Fil_flush */byte** File_header_bufs; /* File Header buffer */byte** archive_file_header_bufs;/* Archive Header information buffer */Ulint archive_space_id;     /* Archive redo log id*/ulint archived_file_no;     /* Log file Number archived */Ulint Archived_offset;     /* The offset of the archive has been completed */Ulint next_archived_file_no;/* The next archived file number */Ulint next_archived_offset;/* Next Archive offset */Dulint Scanned_ Lsn Byte* Checkpoint_buf; /* This log group holds the buffer for checkpoint information */ut_list_node_t (log_group_t) log_groups;} log_group_t;

The Spaceid in the structure definition above is the fil_space_t structure in the corresponding Fil0fil, and a fil_space_t structure can manage multiple files fil_node_t, see here for fil_node_t .

3.1.1LSN and intra-group are offset within the log_goup_t group, among which the more important is the conversion relationship between LSN and intra-group offsets. When a group is created, the LSN and corresponding lsn_offset are set, and if initialized to group LSN = 1024x768, Group Lsn_offset = 2048,group consists of 3 10240-size files, Log_file_hdr_size = 20 48, we need to know buf LSN = 11240 corresponds to the number of offsets within the group, according to the Log_group_calc_lsn_offset function can be derived from the following formula:
Group_size = 3 * 11240;
LSN offset relative to the starting position of the Group = (Buf_ls-group_ls) + log_group_calc_size_offset (lsn_offset) = (11240-1024)-0 = 10216;
Lsn_offset = Log_group_calc_lsn_offset (LSN offset relative to group start position% group_size) = 10216 + 2 * log_file_hdr_size = 14312;
This offset must be added to the length of the file header.

3.1.2File_header_bufs

File_header_bufs is an array of buffer buffers, the array length is consistent with the number of files in the group, and each BUF length is 2048. The information structure is as follows:

log_group_id for IDs in log_group_t structures

FILE_START_LSN the LSN value corresponding to the location data of the current file

File_no the current file number, generally in the archive file header

Hot Backup str An empty string, if hot_backup, will fill in the file suffix ibackup.

The LSN value corresponding to the end data of the File_end_ls file, typically embodied in the archive file.

3.2 Checkpoint

Checkpoint is the checkpoint of the log, the function is that after the database exception, redo log from this point of information to get to the LSN, and the checkpoint after the log and page redo recovery. So how did the checkpoint build? The log LSN written by the journal is the distance from the last generation checkpoint when the LSN reaches a certain gap, the checkpoint is started, and the checkpoint is created, which first writes the dirty data from the in-memory table to the hard disk, and then writes the log of redo log buffer Small to the LSN of this checkpoint to the hard disk. In log_group_t Checkpoint_buf, here is an explanation of its corresponding field:

Log_checkpoint_no CHECKPOINT serial number,

Log_checkpoint_lsn LSN of this CHECKPOINT initiation

Log_checkpoint_offset the starting offset of this CHECKPOINT relative to the group file

Log_checkpoint_log_buf_size Redo LOG buffer size, default 2M

Log_checkpoint_archived_lsn LSN of the current log archive

Log_checkpoint_group_array the file ordinal and offset of each log GROUP when it is archived is an array

3.3 log_t

The

Redo log writes, data brush disks, build checkpoint, and archive operations are all globally unique and Log_sys controlled, a very large and complex structure, defined as follows:

typedef struct log_struct{byte pad[64];                    /* Enables the Log_struct object to be placed in the data in the generic cache line, which has a direct relationship to the CPU L1 cache and data/Dulint LSN;             The serial number of the/*log is actually a log file offset */Ulint buf_free;         /*buf can be written in position */mutex_t mutex;                   /*log protected by mutex*/byte* buf;             /*log buffer */Ulint buf_size;     /*log buffer length */ulint max_buf_free;       /* After the log buffer brush disk, recommend the maximum value of buf_free, more than this value will be forced to brush disk */Ulint old_buf_free;             /* Value of Buf_free last write, used to debug */Dulint OLD_LSN; /* LSN of last write, used to debug */Ibool Check_flush_or_checkpoint;             /* Require log write disk or need to refresh a log checkpoint logo */Ulint buf_next_to_write;         /* BUF offset position to start writing to disk next time */Dulint WRITTEN_TO_SOME_LSN;             /* First group Brush completed yes lsn*/dulint written_to_all_lsn;                       /* lsn*/Dulint FLUSH_LSN is already recorded in the log file;               /*flush of lsn*/Ulint Flush_end_offset;              /* Buf_free of the last log file brush disk, that is, the last flush end offset */Ulint n_pending_writes;          /* The number of Fil_flush is being called */os_event_t no_flush_event; /* All Fil_flush after completion will not trigger this signal, waiting for all goups brush Disk completion */Ibool one_flushed;     /* After a log group is brushed, this value will be set to true*/os_event_t one_flushed_event;                        /* As long as a group flush is complete, this signal will be triggered */Ulint N_log_ios;                   Number of IO operations for the/*log system */Ulint N_log_ios_old; /* Number of IO operations at last count */time_t last_printout_time;     Ulint Max_modified_age_async;       /* threshold value for asynchronous log file Brush disk */Ulint Max_modified_age_sync; /* Synchronization log file Brush disk threshold */Ulint adm_checkpoint_interval;    Ulint Max_checkpoint_age_async;            /* Asynchronously establishes the threshold for checkpoint */Ulint max_checkpoint_age; /* Force establishment of CHECKPOINT threshold */Dulint next_checkpoint_no; Dulint Last_checkpoint_lsn; Dulint Next_checkpoint_lsn; Ulint n_pending_checkpoint_writes;            rw_lock_t Checkpoint_lock;                 /*checkpoint rw_lock_t, in checkpoint time, is exclusive of this latch*/byte* checkpoint_buf; buf*/Ulint Archiving_state of/*checkpoint information storage; Dulint Archived_lsn; Dulint Max_archived_lsn_age_async; Dulint Max_archived_lsn_age; Dulint Next_archived_lsn; Ulint archiving_phase; Ulint n_pending_aRchive_ios; rw_lock_t Archive_lock; Ulint archive_buf_size; Byte* Archive_buf; os_event_t archiving_on;             Ibool online_backup_state;                /* Whether in backup*/dulint online_backup_lsn; /*backup lsn*/ut_list_base_node_t (log_group_t) log_groups;} log_t;

3.3.1relationships and analysis between various LSN There are many LSN-related definitions from the structure definition above, so what are the direct relationships of these LSN? Understanding the relationships between these LSN will have great confidence in understanding the workings of the entire redo log system. Explanations of the following various LSN:

LSN at which the current log system was last written to logs

FLUSH_LSN Redolog Buffer The LSN at the end of the last data brush disk data, as the starting LSN of the next brush disk

Written_to_some_lsn The starting LSN of the last log brush disk for a single log group

WRITTEN_TO_ALL_LSN all log groups The last log brush disk is the starting LSN

LAST_CHECKPOINT_LSN last established LSN of checkpoint log data start

Next_checkpoint_lsn the LSN of the starting checkpoint log data for the next time, obtained with log_buf_pool_get_oldest_modification

Archived_lsn LSN of the last archived log data start

Next_archived_lsn LSN of the next archived log data

The diagram is as follows:

Analysis of 3.3.2 offset

Log_t have various offsets, such as Max_buf_free, Buf_free, Flush_end_offset, Buf_next_to_write, and so on. Offsets are not the same as LSN, offsets are the absolute offsets relative to the location of the redo log buf, and the LSN is the ordinal of the entire log system.

Max_buf_free Write log is not more than the offset position, if exceeded, will force redo log buf write to disk

Buf_free offset where the current log can be written

Buf_next_to_write The data start offset of the next redo log buf data write, the value and Flush_end_offset are consistent after all the brush disk IO is completed.

Flush_end_offset the offset of the data at the end of this brush disk, which is equivalent to the buf_free at the time of the brush, and when the flush_end_offset exceeds half of the max_buf_free, the uncommitted data is moved to red At the front of the Obuffer, Buf_free and Buf_next_to_write will make adjustments.

The size diagram is as follows:

3.4 Memory Structure diagram

4. Log write and log protection mechanism

InnoDB has four kinds of log brush disk behavior, namely asynchronous redo log buffer brush disk, synchronous redo log Buffer brush disk, asynchronous establishment of checkpoint brush disk and synchronization to build checkpoint brush disk. In the InnoDB, the brush disk behavior is very consumption of disk IO, InnoDB to brush disk to do a set of very perfect strategy.

4.1 Redo log Brush disc options

In the InnoDB engine there is a global variable srv_flush_log_at_trx_commit, this global variable is the control Flushdisk strategy, that is, to determine whether the call Fsync this function, when the function is dropped. This variable has a value of 3. The three values are interpreted as follows:

0 Every 1 seconds by the Masterthread control Redo log module calls Log_flush_to_disk to brush the disk, the advantage is increased efficiency, the disadvantage is 1 seconds if the database crashes, logs and data will be lost.

1 each time the redo log is written, Fsync is called to write the log to disk. The advantage is that every time the log is written to disk, the data reliability is greatly improved, the disadvantage is that each call to Fsync will produce a lot of disk IO, affecting database performance.

2 each time the redo log is written, the log is written to the page cache of the log file. In this case, all logs will be lost if the physical machine crashes.

4.2 Log Brush Disk protection

Since redo logs are a process of repeating multiple files in a group, it means that logs can be overwritten if they are not written in time and created checkpoint, which is something we don't want to see. A log protection mechanism is defined in InnoDB, and the storage engine periodically calls the Log_check_margins log function to check the protection mechanism. The following is a brief introduction:

Introduces three variables buf_age, checkpoint_age, and log space sizes.

Buf_age = LSN-OLDEST_LSN;

Checkpoint_age =lsn-last_checkpoint_lsn;

Log space size = number of bytes that the Redo log group can store logs (obtained via log_group_get_capacity);

When the Buf_age >= log space size of 7/8, the Redo log system will be red log buffer asynchronous data brush disk, this time because it is asynchronous, does not cause data operation blocking.

When the Buf_age >= log space size of 15/16, the Redo log system will redlog buffer synchronous data Brush disk, this time will call the Fsync function, the database operation will be blocked.

When Checkpoint_age >= 31/32 of the size of the log space, the log system creates checkpoint asynchronously, and the operation of the database is not blocked.

When the Checkpoint_age = = Log space size, the log system will be synchronized to create checkpoint, a large number of table space dirty pages and log files dirty page synchronization brush into the disk, resulting in a lot of disk IO operations. The database operation is blocked. The entire database transaction is suspended.

5. Summary

InnoDB's redo log system is quite complete, it has made a lot of subtle considerations for the persistence of data, its efficiency directly affects the efficiency of MySQL writing, so we have a deep understanding of it to optimize it for us, especially when a large number of data brush disk. Assuming that the transaction speed of the database processing is greater than the speed of the disk IO, there will be synchronous establishment of the checkpoint operation, so that the database is blocked, the entire database is in the Dirty page brush disk. To avoid this problem is to increase the IO capability, using multiple disks to distribute IO pressure. It is also possible to consider SSD, a high-read-write storage medium to optimize.

MySQL Series: InnoDB source analysis Redo Log structure

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More