SQLServer database Damage Detection

Last Update:2018-06-09 Source: Internet

Author: User

Tags server error log

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In an ideal world, there will be no database corruption, just as we will not include some serious accidents in our daily lives. Once such a thing happens, it will definitely have a very significant impact on our lives. This is also true in SQL Server. Maybe you haven't met this kind of situation in the database in a few years. Once you have met this kind of situation, data loss, downtime, and serious impacts may even affect your career. Therefore, for such cases, we need to know about database damages so that we can prepare and handle them afterwards. This article will discuss the causes and phenomena of database damage, the methods for handling database damage beforehand and afterwards, and the simple methods for restoring database damage.

Why is the database damaged?

Before learning about database corruption, we should first understand how SQL Server saves data to data files (MDF, NDF, etc ). Whether updating or inserting data, data must first reside in the Buffer Pool in the memory, and then the data in the memory is persisted to the disk through the CheckPoint and Lazy Writer processes. In this process, dirty pages of data are written into persistent IO subsystems by memory. During this period, data may pass through these layers according to the IO subsystems:

Windows (WINDOWS APIs must be called for Data Writing)

Underlying Windows middle layer (anti-virus software, disk encryption system)

Nic, vro, vswitch, optical brazing, and network cable (if the I/O subsystem is not directly connected)

SAN controller (if SAN is used)

RAID Controller (I/O Sub-system RAID)

Persistent storage such as disk or SSD

Therefore, data pages may pass through several items in the above list when they are written to persistent storage. In the above process, the hardware environment will be affected by many factors, such as stability of the voltage, power failure, high or low temperature, humidity, etc. In terms of software, because the software is written by people, there may be bugs, which may lead to errors during data page transmission.

In addition, factors that affect the disk include voltage stability, dust, and other factors, which may also cause bad track or overall damage to the disk.

All the factors mentioned above can be attributed to the I/O subsystem. Therefore, the vast majority of data corruption is caused by the I/O subsystem, and a very small probability that the memory chip will also cause data page damage, but this part of the situation is minimal, therefore, it is not discussed in this article.

The above-mentioned causes of data damage are all natural disasters and man-made disasters. For example, when you manually edit data files, such as editors, and there are transactions in the database that require Redo and Undo (that is, there is no Clean Shutdown), the log files are deleted (which usually leads to database challenges ).

Database corruption discovered

After we know the cause of database damage, let's take a look at how SQL Server monitors database page damage.

At the database level of SQL Server, you can set the page protection type. There are three options: None, CheckSum, Torn_Page_Detection, as shown in 1:

. Three options for page Protection

First, ignore None. Do not select this option in any scenario. This option means that SQL Server does not protect the page.

Second, TORN_PAGE_DETECTION. in SQL Server, the minimum unit of data is page, each page is 8 K, but the disk usually has 16 512-byte sectors, if only half of a page is written to persistent storage, TORN_PAGE_DETECTION is called. SQL Server uses the first two bytes of each slice as the metadata, A total of 16 sectors of 32-bit 4-byte metadata (marked as m_tornBits in the page header). This metadata is used to detect the existence of partially written TORN_PAGE, however, this type of page verification cannot detect write errors on the page. Therefore, in SQL Server 2005 and later versions, select CheckSum as much as possible.

In SQL Server 2005 and later versions, CheckSum is introduced. CheckSum can be interpreted as a CheckSum. When data pages are written to persistent storage, based on the page value, a 4-byte CheckSum is calculated and stored in the page header (marked as m_tornBits in the header). The CheckSum and data are saved in the database on the same page. When data is read from the IO subsystem to the memory, SQL Server calculates the CheckSum again based on the value in the page and compares it with the CheckSum stored in the page header, if the comparison fails, SQL Server considers the page damaged.

The CheckSum process shows that the CheckSum is calculated only when the page is written to SQL Server. Therefore, if you only change the database option, the metadata in the page header does not change.

IO-related errors

According to the principle of CheckSum, SQL Server can detect page corruption. At this time, the specific format may be one of the following three errors:

The 823 error, also known as a hard IO error, can be understood as the SQL Server wants to read the page, while Windows tells SQL Server that the page cannot be read.

824 error, also known as soft IO error. It can be understood that SQL Server has read this page, but does not find any matching by calculating the CheckSum equivalent, therefore, SQL Server deems the page damaged.

825 error, also known as Retry error.

Among them, the 823 and 824 errors are both critical errors with the error level of 24, so they are recorded in the Windows Application Log and SQL Server Error Log, the page that causes this error will be recorded in msdb. dbo. suspect_pages. The error page number is also recorded in the SQL Server Error Log, as shown in 2.

. 824 description of errors in SQL Server Error records

Therefore, if we have a complete backup, we can use the backup to restore the page (here, I would like to emphasize that for DBA, there is no "backup" problem ), A simple page restore code is shown in code list 1.

USE [master]

Restore database [Corrupt_DB] PAGE = '1: 155'

From disk = n' C: xxx. Bak'

With file = 1, NORECOVERY, NOUNLOAD, STATS = 5

Code List 1. A simple page restore code that restores page 155th in file ID1 from the backup

Remember what we mentioned earlier, when reading page calculation verification and error, this may be an error in the pages written to persistent storage, or an error in the page reading process, at this time, SQL Server will try to read this page again from the IO subsystem, which may be up to four attempts. If the verification passes during the four attempts, the error 825 will occur, otherwise, the error is 824. Note that, unlike errors 823 and 824, A 825 error is a 10-level message.

Therefore, due to a fixed error number, you can set alarms for 823 and 824 on the SQL Server Agent.

Backup CheckSum

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More