MySQL Engine Features · InnoDB Crash Recovery process

Source: Internet
Author: User
Tags mysql version truncated

MySQL Engine Features · InnoDB Crash Recovery process

In the previous two monthly report, we introduced the knowledge of InnoDB redo log and undo log in detail, and this article will describe the main process of InnoDB in the case of crash recovery.

This article code analysis is based on the MySQL 5.7.7-RC version, the function entry is innobase_start_or_create_for_mysql , this is a very verbose function, this article only involves and crash recovery related code.

Before reading this article, we strongly recommend flipping through our previous two monthly reports:
1. mysql Engine Features · InnoDB Undo Log Roaming 2. MySQL Engine Features · InnoDB Redo Log Roaming

Initialize Crash recovery

First initialize the memory objects needed for the crash recovery:

        recv_sys_create();        recv_sys_init(buf_pool_get_curr_size());

When InnoDB normal shutdown, after flush redo log and dirty pages, a fully synchronized checkpoint is done, and the LSN of checkpoint is written to the first page of Ibdata ( fil_write_flushed_lsn ).

When you restart an instance, the system Tablespace Ibdata is opened and the LSN stored in it is read:

        err = srv_sys_space.open_or_create(        false, &sum_of_new_sizes, &flushed_lsn);

The above call stores the LSN read from ibdata into the variable flushed_lsn, representing the checkpoint point at the time of the last shutdown, which is used when a crash recovery is later done. In addition, the page stored in double write buffer is loaded into memory ( buf_dblwr_init_or_load_pages ), and if the first page of Ibdata is damaged, it is recovered from DBLWR.

Tips: Note In the previous version of MySQL 5.6.16, if the first page of the InnoDB tablespace is damaged, it is considered unable to determine the space ID of the tablespace, and cannot decide which page in dblwr to use for recovery, InnoDB crash recovery failed (bug#70087),
Since the tablespace ID is stored on each data page, the logic here is modified to read a few more pages later and try a different page size until a good data page is found (reference function Datafile::find_space_id() ). Therefore, in order to safely use double write buffer to protect data, it is recommended to use 5.6.16 and later MySQL version.

Recovering TRUNCATE operations

To ensure atomicity of truncate operations on the undo log stand-alone tablespace and the user-independent table space, InnoDB uses a file log to create a unique file for each truncate operation, if the file exists at the time of the restart, indicating the last truncate The operation has not yet completed and the instance crashes, and we need to continue to complete the truncate operation when rebooting.

This piece of crash recovery is independent of the redo log system.

For undo log tablespace Recovery, complete when initializing the Undo subsystem:

err = srv_undo_tablespaces_init(        create_new_db,        srv_undo_tablespaces,        &srv_undo_tablespaces_open);

For the user tablespace, scan the Data directory and locate the Truncate log file: If there is no data in the file, the truncate has not started; If a magic NUM is already written in the file, it means that the truncate operation is complete, neither of which requires processing.

err = TruncateLogParser::scan_and_parse(srv_log_group_home_dir);

However, the recovery of the user tablespace truncate operation is done only after redo log apply is completed, mainly because the recovery truncate may involve updating the system tables (for example, rebuilding the index) and need to be done after redo apply is complete.

Enter redo crash recovery start logic

Entry function:
c err = recv_recovery_from_checkpoint_start(flushed_lsn);

The passed parameter, FLUSHED_LSN, is the LSN read from the first page of Ibdata, which consists mainly of the following steps:

Step 1: Create a red-black tree for each buffer pool instance, pointing buffer_pool_t::flush_rbt , mainly for accelerating the insertion of the flush list ( buf_flush_init_flush_rbt );
Step 2: Reads the checkpoint LSN stored at the first redo log file header and navigates to the corresponding location in the redo log file based on that LSN, starting with the checkpoint point.

A three- recv_group_scan_log_recs scan redo log file is called here:

1. The first goal is to find the Mlog_checkpoint log

The CHECKPOINT LSN is recorded in the Mlog_checkpoint log, and when the LSN and CHECKPOINT LSN recorded in the log header are recorded in the log, the corresponding Mlog_checkpoint LSN is found and the LSN number scanned is recorded to recv_sys->mlog_checkpoint_lsnthe. (no such scan in version 5.6)

Mlog_checkpoint was introduced in wl#7142 to simplify the logic of the InnoDB crash recovery, which, according to the wl#7142 description, contains several improvements:

    1. Avoid a crash recovery by reading the first page of each IBD to confirm its space ID;
    2. No need to check the $DATADIR/*.ISL, the new log type records the full path of the file, and eliminates the problem that the ISL file and the actual IBD directory inconsistency may cause;
    3. Automatically ignores IBD files that have not yet been imported to InnoDB (for example, crash when performing import tablespace);
    4. A new log type, Mlog_file_delete, was introduced to track the removal of IBD files.

The problem that may arise here is that if the Mlog_checkpoint log and the file header record the checkpoint LSN Gap too far, the first scan may take a lot of time to do meaningless parsing, feeling there is still room for optimization.

In my test instance, the Mlog_checkpoint and checkpoint points have an LSN difference of approximately 1G redo log due to the large load applied during the crash.

2. Second scan, repeat scan from Checkpoint point, store log Object

The log resolves the object type to recv_t include the log type, length, data, start, and end LSN. The storage of the log object uses a hash structure, which calculates the hash value based on the Space ID and page No, and the changes on the same page are chained together as a list node, and the approximate structure can be expressed as:

The scanning process is built on redo log records such as Mlog_file_name and Mlog_file_delete recv_spaces , which store space IDs to file information mappings ( fil_name_parse –> fil_name_process ) that may require crash recovery. (In fact, the data is also inserted into the first scan, recv_spaces but only until the Mlog_checkpoint log is logged)

Tips: When you modify data for a table for the first time after a checkpoint, you always write a mlog_file_name log record, and this type of log allows you to track the modified tablespace after checkpoint once and avoid opening all tables.
In the second scan, it is always determined whether the table space to be modified is in recv_spaces , and if it does not exist, it is considered to produce a serious column error, refused to start ( recv_parse_or_apply_log_rec_body )

By default, Redo log is read into a batch of 64KB (Recv_scan_size) log_sys->buf , and then the function is called to recv_scan_log_recs process the log block. This will determine the validity of the log block: Whether it is a full write, whether the log block checksum is correct, and also based on a number of tag bits to make judgments:

    • Each time the redo log is written, the flush bit of the starting block header is always set to true, indicating the starting position of the write, so that when the scan log is restarted, the scan's LSN point is also pushed according to the flush bit;
    • Each time you write redo, you will also record the next checkpoint no on each block (each time you do checkpoint increment), because the log file is recycled, so you need to determine whether to read the old redo log according to Checkpoint No.

For legitimate logs, it is copied into the buffer recv_sys->buf and the function is called to parse the recv_parse_log_recs log records. This will be handled separately according to the different log types, and an attempt is made to apply the stack as:

recv_parse_log_recs    --> recv_parse_log_rec        --> recv_parse_or_apply_log_rec_body

If you want to understand how InnoDB based on different log types for crash recovery, it is very necessary to read the function recv_parse_or_apply_log_rec_body , here is the entry of redo log apply.

For example, if the parsed log type is mlog_undo_hdr_create, the transaction ID is resolved from the log, and the UNDO log header is rebuilt ( trx_undo_parse_page_header ); if it is an insert operation ID (Mlog_rec_insert or Mlog_comp_rec _insert), it is necessary to parse out the index information ( mlog_parse_index ) and record information ( page_cur_parse_insert_rec ), or to parse a in-place UPDATE (mlog_rec_update_in_place) log, the function is called btr_cur_parse_update_in_place .

The second scan only applies logs of type mlog_file_*, logged in recv_spaces , and stored in the hash object for other types of logs after parsing. The function is then called recv_init_crash_recovery_spaces to initialize the table space involved:

    • We'll start by printing two log messages that we're very familiar with:

        not shutdown normally!  [Note] InnoDB: Starting crash recovery.
    • If recv_spaces the tablespace in the table is not deleted and the IBD file exists, it indicates that this is a normal file operation, adding the table space to the list fil_system->named_spaces ( fil_names_dirty ), which may be redo apply to these tables later;

    • For the table space that has been deleted, we can ignore the log apply and set the record entry on the space ID of the corresponding table recv_sys->addr_hash to recv_discarded;

    • Invokes a function that buf_dblwr_process() checks all pages that are recorded in double write buffer, and whether the corresponding data file page is intact and, if damaged, recovers directly from the DBLWR;

    • Finally create a temporary background thread, the thread function for recv_writer_thread this thread and the page cleaner thread to use, it will go to notify the page cleaner thread to flush crash recovery resulting from the dirty page until recv_sys The redo records stored in the application are completed and released completely ( recv_sys->heap == NULL )

3. If the second Scan hash table space is insufficient, cannot be stored in the hash table, then initiates a third scan, empties the hash, and starts scanning again from the checkpoint point.

The maximum space for a hash object is generally buffer pool size-512 page size.

The third scan will not try to store all the hash together, but once found that the hash is not enough, immediately apply redo log. But... If the total log needs to store a hash space that is slightly larger than the maximum available space, then an additional scan overhead is noticeable.

In short, the first scan to find the correct mlog_checkpoint location, the second scan to parse the redo log and stored in the hash; If the hash space is not enough, then another round to start again, parse a batch, apply a batch.

After three scans, there are usually redo logs in the hash that are not applied. This is left behind to do, and will then be recv_sys->apply_log_recs set to true and returned from the function recv_recovery_from_checkpoint_start .

For normal shutdown scenes, once the checkpoint is completed is not recorded Mlog_checkpoint log, if the scanning process did not find the corresponding log, it is considered the last time is normal shutdown, do not consider the crash recovery.

Tips: Occasionally we'll see a message like this in the log:
"The log sequence number XXX in the system tablespace does does match the log sequence number xxxx in the ib_logfiles!"
The internal logic is that the LSN recorded in Ibdata is inconsistent with the checkpoint LSN recorded in Iblogfile, but the system also determines that this error will be reported without a crash recovery. It is possible to simply look at InnoDB instances, because doing checkpint and updating Ibdata is not an atomic operation, so the log information is generally negligible.

Initializing the transaction subsystem (TRX_SYS_INIT_AT_DB_START)

This will involve reading the Undo related system page data, in the crash recovery state, all the page must first log apply before being used by the caller, such as the following stack:

trx_sys_init_at_db_start    --> trx_sysf_get -->        ....->buf_page_io_complete --> recv_recover_page

Therefore, when we initialize the rollback segment, we can restore the rollback segment information to a consistent state by reading the rollback segment page and making redo log apply, which can "revive" the transaction that is active when the system crashes and maintain it in the read-Write transaction list. For transactions in the prepare state, we need to do additional processing later.

For information on how transactions are resurrected from crash recovery, see the April Monthly Report "MySQL Engine Features · InnoDB Undo Log Roaming "The last section.

Apply Redo log ( recv_apply_hashed_log_recs)

Based on the previously collected recv_sys->addr_hash log records, the page is read into memory, and a crash recovery operation is performed on each page ( recv_recover_page_func ):

    • The table space that has been deleted is skipped directly over its corresponding log record;

    • When reading a file page that needs to be recovered, it actively attempts to read the pages in a read-ahead manner ( recv_read_in_area ), collects up to 32 consecutive page No. (Recv_read_ahead_area) to be recovered, and sends an asynchronous read request. When page reads into buffer pool, it actively makes the crash recovery logic;

    • Only the LSN greater than the LSN on the data page will be apply; Ignoring the redo log of the table being truncate;

    • No new redo logs are generated during the recovery of the data page;

    • After you have finished repairing the page, you need to add the dirty page to the flush list in buffer pool, because INNODB needs to ensure the order of the Flush list, and the LSN of the modified page during crash Recovery is based on the redo LSN instead of the global LSN. There is no guarantee of order; InnoDB also maintains a red and black tree to maintain order, and each time it is inserted into the flush list, find the red and black tree to find the appropriate insertion position and add it to the Flush list. ( buf_flush_recv_note_modification )

Complete the crash recovery ( recv_recovery_from_checkpoint_finish)

After completing all redo log apply, the basic crash recovery is also complete, at which time the resources can be freed, waiting for the recv writer thread to exit (the dirty pages generated by the crash recovery have been cleaned up), release the red-black tree, and roll back all the non-prepare state transactions generated by the Data dictionary operation ( trx_rollback_or_clean_recovered)

Invalid data cleanup and transaction rollback:

Call the function to recv_recovery_rollback_active complete the following work:

    • Delete the temporarily created index, such as the residual Temporary index () when crash is created when the DDL is indexed row_merge_drop_temp_indexes() ;
    • Cleanup InnoDB temporary table ( row_mysql_drop_temp_tables );
    • Clean the Invalid secondary table () of the full-text index fts_drop_orphaned_tables() ;
    • Creating a background thread, the thread function is different from the trx_rollback_or_clean_all_recovered call in, and recv_recovery_from_checkpoint_finish the background thread rolls back all transactions that are not in the prepare state.

At this point, the INNODB layer of the crash recovery is over, only the prepare state of the transaction remains to be processed, and this part needs to and the server layer of Binlog Federated to perform a crash recovery.

Binlog/innodb XA Recover

Back at the server level, once the various storage engines have been initialized, if Binlog is turned on, we can use Binlog for XA recovery:

    • First, scan the last Binlog file, find all the XID events, and record the XID into a hash structure ( MYSQL_BIN_LOG::recover );
    • The interface function is then called on each engine to xarecover_handlerton get the transaction XID in the prepare state in each transaction engine, and if the XID exists in Binlog, commit; otherwise roll back the transaction.

Obviously, if we weaken the persistence of the configuration ( innodb_flush_log_at_trx_commit != 1 or sync_binlog != 1 ), downtime can result in two scenarios of data loss:

    1. The engine layer was submitted, but the Binlog was not written, the repository lost the transaction;
    2. The engine layer has no prepare, but Binlog writes, and the main library loses the transaction.

Even if we set the parameters innodb_flush_log_at_trx_commit =1 to sync_binlog = 1 A and, we will also face the situation: the main library crash when there are binlog not delivered to the repository, if we directly upgrade the repository as the main library, the same will lead to inconsistencies, the old Main library must be re-made according to the new main library to restore to a consistent state. For this scenario, we can solve it by opening semisync, a workable scenario described below:

    1. Set the double 1 strong persistent configuration;
    2. We set the Semisync timeout to a maximum value while using the Semisync after_sync mode, where the user thread waits for the standby ACK before the engine layer commits after writing binlog;
    3. Based on the configuration of step 1, we can guarantee that when the main library crash, all the old master cubby the more transactions are in prepare state;
    4. After the repository has fully apply the log, note the location of the relay log to which it was executed, and then promote the standby to the new main library;
    5. The last binlog of the old Main library is truncated, and the truncated bit is the point recorded in step 3;
    6. Start the Old master library, the transactions that have been passed to the repository are committed, and the binlog that are not delivered to the repository are rolled back.

MySQL Engine Features · InnoDB Crash Recovery process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.