Original: http://www.talkwithtrend.com/Article/216335 Preface
The most serious failure of a database is database corruption. Database Bad page is a kind of database corruption, if there is data page corruption in the database, when the bad page is not accessed, the database can provide the service normally, when using the table with the bad page, it can cause the database to crash.
There are many possibilities for a bad page in the database, such as sudden power-down, host down, or storage, disk failure, and so on.
When the database appears bad page, first use the operating system command to determine whether a hardware failure, repair hardware failure, you can try to use the "DB2 Restart DB dbname" command to get the database to perform crash recovery, this method may restore the database is normal. If not, the best way to do this is to restore the database from backup.
If it is not possible to recover from a backup, the database can be rebuilt using the Db2look export table structure and export exported data in case the database can be connected. Of course, for corrupted tables, you can only use the Db2dart tool to export data from these tables offline, and the database on the bad page is most likely to be lost.
If the database cannot be connected, you can only export data for the entire database using Db2dart offline. After exporting the data, the rest is rebuilding the library and importing the data.
This paper records the damaged database page header in a production environment, and after the twists repair process, the majority of the data is restored and the database is restored to normal operation.
Environment Introduction
Operating system version AIX 6100-07-04-1216
Database version DB2 v9.7.0.7
The database is relatively large, about 2.2T, the number of bad page table data is also relatively large, about 30G.
First, the problem found
Business colleagues query XXX database for a report query, the following error occurred:
Second, preliminary analysis
Check the explanation for this error:
DB2 "? SQL1655C "
To view the NFY log for a database:
More Db2npsh.nfy
2017-08-07-14.14.44.308129 Instance:db2npsh node:000
pid:12517626 (db2agent (NPSH) 0) tid:25497 appid:*local.db2npsh.170807055005
Buffer Pool Services Sqlbreadpage probe:1199 DATABASE:NPSH
ADM6006E DB2 encountered an error while reading page ' 114482656 ' from table
Space "3" for the Object "1859" (located at offset "114482688" of Container "/dev/rlvts_npsh_d8").
2017-08-07-14.14.55.313501+480 e1900111a837 level:critical
pid:12517626 tid:11548 PROC:DB2SYSC 0
Instance:db2npsh node:000
eduid:11548 EDUNAME:DB2PFCHR (NPSH) 0
FUNCTION:DB2 UDB, buffer pool services, sqlblogreadattemptfailure, probe:10
Message:adm14001c an unexpected and critical error has occurred: "Badpage".
The instance may have been shutdown as a result. "Automatic" FODC
(first occurrence Data Capture) has been invoked and diagnostic
Information have been recorded in directory
"/db2/dump/npsh/fodc_badpage_2017-08-07-10.09.10.302693_0000/".
Detailed evidence about
happened and contact IBM support if necessary to diagnose the
Problem.
At the same time, also found in the dump directory generated fodc_badpage files.
Third, further analysis
Try to locate the table with the bad page:
According to the information in the Db2npsh.nfy Log, table space "3" for Object "1859", the query is as follows:
DB2 "Select Tabname,status from Syscat.datapartitions where Tbspaceid =3 and tableid=1859"
There is no output, it is suspected that a partition of the partition table is broken, the query is as follows:
DB2 "Select Tabname,datapartitionname from Syscat.datapartitions where tbspaceid=3 and partitionobjectid=1859"
The output is as follows:
NXXXXTRFH PART201707
It can be judged that there is a bad page in this partition.
Query the status of the table, Normal:
DB2 "Select Tabname,status,tbspaceid,tableid from Syscat.tables where tabname= ' NXXXXTRFH '"
Query data from this table, normal:
DB2 "SELECT * from DBNPSH. NXXXXTRFH fetch first Rows only "
So the bad page of the database is more complex, sometimes the bad page causes the whole table to fail, but this time, the surface looks up, the table data all normal. A failure is reported when the running report accesses a specific page.
Try to export the data from the table partition of the bad page through the load from cursor method, and report the same error to the query statement:
Try exporting data through export:
DB2 "Export TO/DB2/ARCHLOG/NPSH/NXXXXTRFH_PART201707.IXF of IXF select * from DB2NPSH. nxxxxtrfh_part201707 "
Export hang, after two hours, export more than 4 million data, the same error occurred, and the data volume is obviously wrong
Collect all the log information and send it to IBM800 to assist in analyzing the problem.
Db2support. -D Npsh-c-S
After analysis, IBM second-line confirmed that the database bad page, we recommend the use of Db2dart export table data:
Iv. Data Repair Steps
Here is the process of retrieving data:
Unloading table partitions
Because the table is 30G in size, attempt to unload the table partition and export it separately:
DB2 ALTER TABLE DBNPSH. NXXXXTRFH Detach partition PART201707 into nxxxxtrfh_part201707
Uninstall succeeded
Db2dart preparatory work
Before executing Db2dart, you need to prepare two points:
1. Disconnect all data connections
DB2 Force Applicaitons All
2, the expansion of the dump directory to be able to drop the entire partition data
Db2dart Exporting data
Db2dart Exporting data
Start execution of Db2dart export data:
Db2dart Npsh/ddel
Then type: 1859,3,0,999999
(The meanings of four parameters are: Object ID, tablespace ID, start page number, export total pages)
IBM recommends exporting total pages set to 999999 pages
Db2dart operation 2 hours, found that the number of exported pages has reached 999998, suspect that the number of exported pages exceeds the set 999999 limit. It seems that IBM's recommendations are not entirely accurate.
Change the number of pages to 99999999 and re-execute the Db2dart export (you will need to delete the previously exported Dart directory first):
Db2dart Npsh/ddel
Then type: 1859,3,0,99999999
After three hours, the export was successful, with a total of more than 1.1 million pages exported.
Export file name default Ts3t1859.del.
Put dart-out data into a temp table
New temporary table nxxxxtrfh_part20170702, note that the specified tablespace is consistent with the NXXXXTRFH master table.
Import Dart-out data into a staging table:
Db2-v "LOAD from ts3t1859. Del of Del INSERT into Db2npsh. nxxxxtrfh_part20170702 nonrecoverable "
Import "field type Mismatch" error, after troubleshooting, because the table contains Dbclob, this field dart command cannot be exported. Query the original table, found that the field is empty, glad not to lose large pieces of data, but because the data file and table structure does not match, imported into a problem.
There are two options for solving this problem:
First, edit the data file, navigate to the column, insert a comma, insert a column of empty columns.
Second, create a new table that does not contain large columns, and then add more fields after importing the data.
New nxxxxtrfh_part20170703 does not contain DBCLOB columns and imports data:
Db2-v "LOAD from ts3t1859. Del of Del INSERT into Db2npsh. nxxxxtrfh_part20170703 nonrecoverable "
Number of rows read = 23771029
Number of rows skipped = 0
Number of rows loaded = 23770943
Number of rows rejected = 86
Number of rows deleted = 0
Number of rows committed = 23771029
Adding columns
DB2 "ALTER TABLE DB2NPSH. nxxxxtrfh_part20170703 Add Column URL Dbclob (2048) logged not COMPACT "
Successful execution. To mount a temporary table on the primary table
To mount a temporary table on the main table:
DB2 "ALTER TABLE DBNPSH. NXXXXTRFH Attach partition PART201707 starting (' 20170701 ') ending (' 20170731 ') from DB2NPSH. nxxxxtrfh_part20170703 "
Mount failed:
sql20408n Table "Db2npsh. nxxxxtrfh_part20170703 "cannot be attached to
Table "Dbnpsh. NXXXXTRFH "because column" Payeenm "of the source table and
Its associated column "URL" of the target table does not match. Reason code = "8". Sqlstate=428ge
analyzed because the column order of the staging table is inconsistent with the primary table.
Workaround:
Insert the data for the 03 temporary table into the 02 temporary table by specifying the columns:
db2-v insert INTO DB2NPSH. nxxxxtrfh_part20170702 SELECT Msgid,instgbkid,msgcd,seqnb,instgdrctpty,biztyp,bizkind,syscd,regtime,tranchnltyp, CREDTTM,INSTDDRCTPTY,INSTDPTY,RMK, Payernm,payerdpstbknm,dbtrdpshdl,payertele,payeracctno,payeracctyp,pwd, Pymtagrmt,authpymtbizkind,ptcid,url,payeenm,cdtrdpshdl,payeedpstbkid,payeedpstbknm,payeeacctno,acctltd, AGRINEFFCTVDT,AGREFFCTVDT,SNGLTXAMTLMT,ORIGMSGID,CSTMRNB,CDTRCSTMRID,DLTTLCNT,DAYAMTUPPERLMT,CURCD,MSLTTLCNT, Mthamtupperlmt,magetyp,acqfee,clrdate,bizprcscd,bizrjctcd,bizretsts,rjctresn,bizsts,ctrlind,bizbigtyp,addinfo1 , Addinfo2,addinfo3,addinfo4,addtbnm,signature,tradeflag,rpliracct,rplirnm,rplirdpsnm,rplirdpshdl,rpliraccttp, QRISTACCT,QRISTNM,QRISTPTYNM,QRISTPTYHDL from DB2NPSH. nxxxxtrfh_part20170703 "
Mount the 02 temp table to the main table:
DB2 "ALTER TABLE DBNPSH. NXXXXTRFH Attach partition PART201707 starting (' 20170701 ') ending (' 20170731 ') from DB2NPSH. nxxxxtrfh_part20170702 "
Mount succeeded
SQL3601W The statement caused one or more tables to automatically be placed
In the Set Integrity Pending state. sqlstate=01586
Consistency check
To perform a consistency check:
DB2 set integrity for DBNPSH. NXXXXTRFH Immediate Checked
Execute about 10 minutes, success.
db20000i the SQL command completed successfully.
Data Volume Confirmation
Query the amount of data for the 201707 partition:
DB2 "SELECT COUNT (*) from DBNPSH. NXXXXTRFH where clrdate>= ' 20170701 ' and clrdate<= ' 20170731 ' "
Total number of data imports: 23770943
Number of original data bars:
DB2 "SELECT COUNT (*) from DB2NPSH. nxxxxtrfh_part201707 "
Total number of raw data: 23771388
Number of corrupted data bars: 23771388-23770943 = 445 strips
At this point, the data recovery is complete, retrieve 23,770,943 data, damage 445.
Go DB2 database A production failure detail record---database Bad page