Recovery process for system crashes caused by power off of Oracle RAC database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Here a simple note, the National Day overtime recovery of a customer's 2 sets of Oracle RAC database, the entire recovery process, 2 sets of RAC similar, so here is a set of database recovery process as an example of simple analysis. Because the database is not archived, the system will not open properly due to power off and reboot.

In the normal open process, the error is as follows:

sql> ALTER DATABASE open;
ALTER DATABASE Open
*
ERROR at line 1:
Ora-00600:internal error code, arguments: [KCRATR1_LASTBWR], [], [], [], [],
[], [], []

sql> shutdown Immediate
Ora-01109:database not open

Database dismounted.
ORACLE instance shut down.
Sql> Startup
ORACLE instance started.

Total System Global area 4294967296 bytes
Fixed Size 2089576 bytes
Variable Size 2751466904 bytes
Database buffers 1526726656 bytes
Redo buffers 14684160 bytes
Database mounted.
Ora-00600:internal error code, arguments: [KCRATR1_LASTBWR], [], [], [], [],
[], [], []
For this error, the online solution is also a lot of, unfortunately, not useful. In this case, it is often necessary to force open the database, first need to do an incomplete recovery, as follows:

sql> Recover Database
Ora-00279:change 236912204 generated at 09/29/2015 12:49:13 for thread
1
Ora-00289:suggestion:/xxxx/1_5112_877094801.dbf
Ora-00280:change 236912204 for thread 1 was in sequence #5112

Specify log: {<ret>=suggested | AUTO | CANCEL}
Auto
Ora-00308:cannot Open archived log '/xxxx/1_5112_877094801.dbf '
Ora-27037:unable to obtain file status
IBM AIX RISC system/6000 error:2: No such file or directory
Additional Information:3
After the relevant operation, I backed up the current control file information to facilitate the following if there is a problem, easy to handle. In the process of forcing open, we found the following error:

Sat Oct 3 11:49:31 2015
Instance recovery:looking for dead threads
Instance recovery:lock domain Invalid but no dead threads
Sat Oct 3 11:49:33 2015
Errors in FILE/ORACLE/ORACLE/ADMIN/CWDB/UDUMP/CWDB1_ORA_6029586.TRC:
Ora-00600:internal error code, arguments: [Kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [ ], []
Sat Oct 3 11:49:34 2015
Errors in FILE/ORACLE/ORACLE/ADMIN/XXXX/UDUMP/XXXX1_ORA_6029586.TRC:
Ora-00600:internal error code, arguments: [Kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [ ], []
Sat Oct 3 11:49:34 2015
Error happened during DB Open, shutting down database
User:terminating instance due to error 600
Instance terminated by USER, PID = 6029586
This mistake has been handled many times. Similarly, Baidu, will find that a lot of people have written related articles, including Oracle MOS article explanation is also said that this is a temporary block of SCN too large, through the drop Tempfile can bypass the problem. In fact, in this case, it will not work at all.
But no matter what, the problem is obviously related to the SCN of the block. Since it is related to the SCN, then processing is not difficult, by pushing the SCN can be.
After pushing the SCN, open Resetlogs successfully opened the database, but alert log reported a bunch of errors, as follows:

Sat Oct 3 13:10:34 2015
Errors in FILE/ORACLE/ORACLE/ADMIN/XXXX/BDUMP/XXXX1_SMON_10420246.TRC:
Ora-00600:internal error code, arguments: [4137], [], [], [], [], [], [], []
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 40
Sat Oct 3 13:10:35 2015
ORACLE Instance xxxx1 (PID = +)-Error encountered while recovering transaction (23, 85).
Sat Oct 3 13:10:35 2015
Errors in FILE/ORACLE/ORACLE/ADMIN/XXXX/BDUMP/XXXX1_SMON_10420246.TRC:
Ora-00600:internal error code, arguments: [4137], [], [], [], [], [], [], []
Sat Oct 3 13:10:35 2015
Trace dumping is performing id=[cdmp_20151003131035]
Sat Oct 3 13:10:35 2015
Replication_dependency_tracking turned off (no async multimaster replication found)
Sat Oct 3 13:10:36 2015
Instance recovery:looking for dead threads
Instance recovery:lock domain Invalid but no dead threads
Sat Oct 3 13:10:36 2015
Errors in FILE/ORACLE/ORACLE/ADMIN/XXXX/BDUMP/XXXX1_SMON_10420246.TRC:
Ora-00600:internal error code, arguments: [4137], [], [], [], [], [], [], []
Sat Oct 3 13:10:37 2015
Starting background Process QMNC
Sat Oct 3 13:10:37 2015
ORACLE Instance xxxx1 (PID = +)-Error encountered while recovering transaction (23, 85).
Sat Oct 3 13:10:37 2015
Errors in FILE/ORACLE/ORACLE/ADMIN/XXXX/BDUMP/XXXX1_SMON_10420246.TRC:
Ora-00600:internal error code, arguments: [4137], [], [], [], [], [], [], []
QMNC started with pid=53, OS id=7536816
Sat Oct 3 13:10:41 2015
Logstdby:validating Controlfile with logical metadata
Sat Oct 3 13:10:41 2015
Logstdby:validation Complete
Sat Oct 3 13:10:46 2015
Errors in FILE/ORACLE/ORACLE/ADMIN/XXXX/BDUMP/XXXX1_MMON_9110004.TRC:
Ora-00600:internal error code, arguments: [Qertbfetchbyrowid], [], [], [], [], [], [], []
Sat Oct 3 13:10:48 2015
Errors in FILE/ORACLE/ORACLE/ADMIN/XXXX/UDUMP/XXXX1_ORA_6619434.TRC:
Ora-00600:internal error code, arguments: [KDSGRP1], [], [], [], [], [], [], []
Completed:alter Database Open Resetlogs
This part of the error handling is actually not difficult. For the first ora-00600 [4137] Error, obviously this is related to undo, in which the 23 performance of the number 23rd rollback segment (23,85), by shielding the 23rd number of rollback segment can easily resolve the error, of course, now cause the inconsistency of the transaction, this is no way, With the undo exception, Oracle has no way to perform a normal transaction recovery.
Second, for the 2nd ora-00600 [Qertbfetchbyrowid] Error, the processing is also very simple, the general meaning is through the ROWID access to get the data there is an exception, obviously this is related to index, by rebuilding the index can solve the problem, followed by the last one [ KDSGRP1] Error is more common, and is often the problem of index, can be rebuilt.
It seems that all the recovery process is simple, very smooth, but the real problem here, the real problems began.
This is the last seemingly simple error ora-00600 [KDSGRP1] error, which has great difficulties for us. First, let's look at the objects that were involved when the error occurred:

Validate Domain 0
Validated domain 0, flags = 0x0
Kwqmnich:current Time:: 13:31:34
Kwqmnich:instance no 0 check_only flag 1
Kwqmnich:initialized Job Cache structure
Row 0041edda.2e Continuation at
file# 1 block# 126426 slot found
**************************************************
Kdstabn_get:0 ... ntab:1
Curslot:47 ... nrows:175
**************************************************
2015-10-03 13:31:40.864
Ksedmp:internal or fatal error
Ora-00600:internal error code, arguments: [KDSGRP1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT Owner,name,type,count (*) from Dba_source WHERE SUBSTR (owner,1,4) = ' FMIS ' GROUP by Owner,name,type having COUNT (*) & gt;1000
-----Pl/sql Call Stack-----

Object ID on block? Y
seg/obj:0x12 csc:0x01.b1957474 itc:3 flg:o typ:1-DATA
fsl:0 fnx:0x0 ver:0x01

Itl Xid Uba Flag Lck SCN/FSC
0x01 0x0009.00e.0003968d 0x0c07f50f.1ea7.5d--u-1 FSC 0x0053.b1957475
0x02 0x0017.005.0006755c 0x0c0774de.1be3.38 C---0 SCN 0x0001.b1957451
0x03 0x0002.042.00010d39 0x0080b268.0cf1.22 C---0 SCN 0x0001.b1957429

Data_block_dump,data header at 0x110f46074
===============
Tsiz:0x3f88
hsiz:0x170
pbl:0x110f46074
Bdba:0x0041edda
76543210
flag=--------
Ntab=1
nrow=175
Frre=0
fsbo=0x170
fseo=0x783
Avsp=0x3cbe
Tosp=0x3d13
0xe:pti[0] nrow=175 offs=0
0x12:pri[0] Sfll=1
0X14:PRI[1] sfll=2

We can see that, apart from other Non-core objects, there is also a obj#=18, that is, the core Data dictionary table obj$. And the index on the Data dictionary table,
I_obj1,i_obj2,i_obj3 is the core object of object_id less than 57, which is the core data dictionary object belonging to bootstrap$. That is, index cannot be rebuilt either through the
rebuild,38003 event or in upgrade mode.
Of course, this is not to say that there is absolutely no way to recreate the data dictionary, and I have an article later that will believe how to reconstruct it.
During the analysis, I found that the previous 2 index was problematic, as follows:

sql> Analyze table obj$ validate structure;

Table analyzed.

Sql> Select Index_ Name from dba_indexes where table_name= ' obj$ ';

index_name
------------------------------
i_obj1
i_obj2
i_obj3

sql> Analyze index i_obj1 validate structure;
Analyze index I_OBJ1 validate structure
*
ERROR at line 1:
Ora-08100:index are not valid-see trace file For diagnostics
&NBSP
sql> Analyze index I_OBJ2 validate structure;
Analyze index I_obj2 Validate Re
*
ERROR at line 1:
Ora-08100:index isn't valid-see trace file for diagnostics

Sql> A Alyze index I_OBJ3 validate structure;

Index analyzed.

But here we also have to pay attention to, although the Front 2 index all have problems, but the above mentioned error when the index is not 2 are used, in fact, just use the first index on
The error ora-00600 wrong. Because the customer wants to export the data through the EXPDP schema, it finds that the execution times is wrong ora-00600 [KDSGRP1], including the exp execution
Error, but exp tables do not complain; because there are too many objects, nearly 500,000 objects (including tables, index, and others). Obviously, only through user-level exports.
Then it means we have to fix this mistake.
With the dump-related block, we found the error to be very strange, as follows:

Session ID: (1052.1243) 2015-10-03 17:56:59.784
Block Checking:dba = 4684328, block Type = ktb-managed data block
Previous block DBA Dba:477a2adoes not match me previous block DBA Dba:477a26
Row 0:key out of order
----End Index Block validation
----for block 4684328 = 0x00477a28
Block Header Dump:0x00477a28
Object ID on block? Y
seg/obj:0x24 csc:0x02.80026908 itc:19 flg:o typ:2-INDEX
fsl:0 Fnx:0x477a1c ver:0x01

Itl Xid Uba Flag Lck SCN/FSC
0x01 0x0015.01f.00020a0a 0x00867c38.0862.04 C---0 SCN 0x0001.83f32656
0x02 0x002e.044.00002415 0x0082989a.0477.52 C---0 SCN 0x0001.74d90109
0x03 0x002c.00a.000012c5 0x0b804b78.012b.70 C---0 SCN 0x0001.74d9011c
0x04 0x0002.001.00003a72 0x00861aab.024b.0a C---0 SCN 0x0001.74fa8cf1
0x05 0x002f.02e.000065eb 0x0b816866.0217.5b C---0 SCN 0x0001.74fb0205
0x06 0x0026.011.00002000 0x0b814f02.016d.14 C---0 SCN 0x0001.74fc3dbf
0x07 0x0006.011.00003b1c 0x0084c65b.01f9.10 C---0 SCN 0x0001.74fde478
0x08 0x0029.05f.00003258 0x0084faa6.017c.2f C---0 SCN 0x0001.74fdedad
0x09 0x001b.023.00003fe6 0x008a8718.025b.3b C---0 SCN 0x0001.74fdfedc
0x0a 0x001b.006.00004057 0x0b804f24.025c.33 C---0 SCN 0x0001.751a25ed
0x0b 0x001b.02a.00004039 0x0b804f24.025c.49 C---0 SCN 0x0001.751a2609
0x0c 0x001b.024.0000404f 0x0b804f24.025c.63 C---0 SCN 0x0001.751a2627
0x0d 0x000a.018.00021b0b 0x0b821c8b.04cc.45 C---0 SCN 0x0001.751a263f
0x0e 0x000a.055.00021b0a 0x0b821c8b.04cc.5e C---0 SCN 0x0001.751a265c
0x0f 0X000A.031.00021AEC 0x0b821c8c.04cc.09 C---0 SCN 0x0001.751a2678
0x10 0x0022.05e.00001dbd 0x008a88df.01e0.2e C---0 SCN 0x0001.74d8ff09
0x11 0x0025.010.00001ead 0x00860446.01ac.20 C---0 SCN 0x0001.74d8ff7d
0x12 0x002b.043.00001228 0x0084dc65.0164.11 C---0 SCN 0x0001.74d8ffe2
0x13 0x001f.015.00001de8 0x0b81a76a.0183.4e C---0 SCN 0x0001.74d90071

Leaf Block Dump
===============
Header Address 504403185228956148=0x70000064725c1f4
Kdxcolev 0
Kdxcolev Flags =--
Kdxcolok 0
KDXCOOPC 0x80:opcode=0:iot flags=---is converted=y
Kdxconco 1
KDXCOSDC 2
Kdxconro 0
KDXCOFBO 36=0x24
Kdxcofeo 15816=0x3dc8
Kdxcoavs 15780
KDXLESPL 0
Kdxlende 0
KDXLENXT 4684330=0X477A2A
KDXLEPRV 4684326=0x477a26
KDXLEDSZ 6
Kdxlebksz 15816
Dummy Key * * *
ROW#0[9549] Flag:---D--, lock:0, len=14, Data: (6): 7c
Col 0; Len 5; (5): C4 2f
-----end of the leaf block dump-----
Dumping parent of corrupted subtree, row # = 1202
Dumping parent of corrupted subtree, row # = 3

For this error of the index, I met with the old bear for the first time, and he thought it might be the index split. I used bbed when I was a remote.
After analyzing the nearly 10 index blocks, we found that they did not match by comparing the index list.
For this case, want to go through bbed to fix index, difficult to imagine, so decisively give up this way. Finally helpless, only by processing the data dictionary table
Way to dispose of these 2 index I_OBJ1,I_OBJ2. Finally let the customer export exp User level, but this export time is relatively long

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Recovery process for system crashes caused by power off of Oracle RAC database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Recovery process for system crashes caused by power off of Oracle RAC database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support