The failure of database and its recovery strategy

Last Update:2018-07-26 Source: Internet

Author: User

Tags commit rollback

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

During a database run, a variety of failures can occur, broken down into three categories: transactional failure, system failure, and media failure. Different recovery strategies should be taken depending on the type of failure.
　　1 , transaction failure and its recovery:
A transaction failure represents a failure caused by an unexpected, unhealthy program ending.
The causes of the abnormal end of the program include data errors, operation overflow, violation of storage protection, and the deadlock of parallel transactions.
In the event of a transaction failure, a transaction that was forced to be interrupted may have been modified to the database, in order to eliminate the impact of the transaction on the database, to take advantage of the information recorded in the log file, to forcibly rollback (RoLLBAcK) The transaction, restore the database to the original state before the modification.
To do this, check the log file for changes caused by these transactions and cancel any changes that have been made to these unfinished transactions.
This type of recovery operation is called transaction revocation (undo), as follows.
(1) Reverse scan the log file to find the update operation for the transaction.
(2) The update operation of the transaction performs an inverse operation, which is to delete the new record that has been inserted, inserts the deleted record, restores the old value to the modified data, and replaces the new value with the old value. This lets you scan through all the update operations that the transaction has done, one after the other, and do the same until the start tag of the transaction is scanned, and the transaction fails to recover.
Therefore, a transaction is a unit of work and a recovery unit. The shorter a transaction is, the easier it is to undo the operation. If an application runs for a long time, the application should be split into multiple transactions, ending each transaction with an explicit commit statement.
2, system failure and its recovery :

System failure is the system in the course of operation, for some reason, causing the system to stop operation, so that all the running transactions are terminated in an abnormal manner, requiring the system to restart. The cause of the system failure may be a hardware error (such as a CPU failure, an operating system) or a DBMS code error, a sudden power outage, and so on.
At this point, the contents of the in-memory database buffers are all lost, although the database stored on the external storage device is not corrupted, but its contents are unreliable. After the system failure, the impact on the database has the following two kinds of situations.
One scenario is that some outstanding transactions have been written to the database for updates to the database, so that after the system restarts, all outstanding transactions are forcibly revoked (undone), and the changes made to the database are purged from those transactions. These end-of-completion transactions have only a begin translatl0n tag in the log file, without a commit tag.
In another case, the results of some committed transactions being updated on the database remain in the buffer, not written to the physical database on disk, which also leaves the database in an inconsistent state, so the results of those transactions should be re-written to the database. This type of recovery operation is called a redo of a transaction (REDo). This commit transaction has both a Bgin transcation tag and a commit tag in the log file.
Therefore, the recovery of a system failure is done in two ways, both to undo all the last completed transactions and to redo all committed transactions in order to truly restore the database to a consistent state. This is done as follows.
(1) Scan the log file, look for transactions that have not yet been committed, and revoke their transaction identifier to the queue. Finds the committed transaction at the same time, and the transaction identity is recorded in the redo queue.
(2) Undo each transaction in the undo queue. Method and the revocation method described in the transaction failure.
(3) Redo the processing of the various transactions in the redo queue. The redo is done by scanning the log file forward, re-executing the operation according to the contents of the log file, and restoring the database to the most recent available state.
After a system failure, because there is no way to determine which of the last completed transactions have been updated, which transactions have not yet been written to the database, so after the system restarts, it is necessary to undo all the last completed transactions, redo all committed transactions.
However, some of the transactions that have been completed before the failure occur are normally ended, and some are abnormally ended. So there is no need to undo or redo them all.
A checkpoint (CheckPoint) method is usually used to determine whether a transaction ends normally. Every once in a while, say 5 minutes, the system generates a checkpoint and does the following: A, write the content that remains in the log buffer to the log file, B, write a checkpoint record in the log file, and C, write the contents of the database buffer into the database, Writes the updated content to the physical database; D, writes the address of the checkpoint record in the log file to the "Restart file".
Each checkpoint record contains information about the list of all active transactions at the checkpoint time, the address of the most recent log record for each transaction.
On restart, the recovery manager obtains the address of the checkpoint record from the restart file, finds the checkpoint record from the log file, and returns through the log to determine which transactions need to be revoked, revert to the initial state, and which transactions need to be re-made. The use of checkpoint information in order to achieve timely, effective and correct recovery work.
　　 3, media failure and its recovery

Media failure refers to the loss of part or all of the data stored in the external memory as a result of the destruction of the secondary memory media during the operation of the system.
Such failures are less likely to occur than transactional and system failures, but this is the most severe of these failures, which can be devastating, physical data and log files on disk may be corrupted, which requires loading the most recent backup database copy before a media failure, and then using the log file to redo all the transactions that were run after the copy was made.
Here's how.
(1) Mount the most recent database copy and restore the database to the available state at the time of the most recent dump.
(2) Mount the most recent copy of the log file and redo the completed transaction based on the contents of the log file. The log file is scanned first to identify the transaction that was committed when the failure occurred and to record it in the redo queue. The log file is then scanned to redo the transactions in the redo queue by forward scanning the log file, re-performing the enlistment for each redo transaction, and writing the "updated value" in the log record to the database.
This allows the database to be restored to a consistent state at some point before the failure.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More