Recovery process of system crash caused by power loss of oracle Database

Source: Internet
Author: User
Tags current time dba error code error handling session id rollback valid oracle database

Here is a simple record of the two Oracle RAC databases of a customer who worked overtime for the National Day. During the recovery process, the two rac databases were similar, therefore, we will take the restoration process of a set of databases as an example for a brief analysis. Because the database is not archived, the system cannot open normally after restart due to power failure.
During normal open, the following error is reported:

SQL> alter database open;
Alter database open
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [], [],
[], [], []
 
SQL> shutdown immediate
ORA-01109: database not open
 
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
 
Total System Global Area 4294967296 bytes
Fixed Size 2089576 bytes
Variable Size 2751466904 bytes
Database Buffers 1526726656 bytes
Redo Buffers 14684160 bytes
Database mounted.
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [], [],
[], [], []

There are also many solutions for this error on the Internet, but it is a pity that they will not work. In this case, it is often necessary to forcibly open the database. The first step is to do an incomplete recovery, as shown below:

SQL> recover database
ORA-00279: change 236912204 generated at 12:49:13 needed for thread
1
ORA-00289: suggestion:/xxxx/rj00002_877094801.dbf
ORA-00280: change 236912204 for thread 1 is in sequence #5112
 
Specify log: {<RET> = suggested | filename | AUTO | CANCEL}
Auto
ORA-00308: cannot open archived log '/xxxx/commandid 2_877094801.dbf'
ORA-27037: unable to obtain file status
Ibm aix risc System/6000 Error: 2: No such file or directory
Additional information: 3

After the related operations, I backed up the current control file information to facilitate subsequent troubleshooting. The following error is reported during the forced open process:


Sat Oct 3 11:49:31 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sat Oct 3 11:49:33 2015
Errors in file/oracle/admin/cwdb/udump/cwdb1_ora_6029586.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [], []
Sat Oct 3 11:49:34 2015
Errors in file/oracle/admin/xxxx/udump/xxxx1_ora_6029586.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [], []
Sat Oct 3 11:49:34 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600.
Instance terminated by USER, pid = 6029586
This error has been handled multiple times. Similarly, Baidu may find that many people have written related articles, including Oracle mos, which also explains that this problem is caused by the large scn of the temporary block and can be bypassed by dropping tempfile. In fact, in this case, it does not work at all.
But in any case, this problem is obviously related to the block scn. Since it is related to scn, it is not difficult to handle it. Simply push the scn.
After advancing the scn, open resetlogs again to open the database. Unfortunately, alert log reports a bunch of errors, as shown below:


Sat Oct 3 13:10:34 2015
Errors in file/oracle/admin/xxxx/bdump/xxxx1_smon_000020246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Opening with internal Resource Manager plan
Where numa pg = 1, CPUs = 40
Sat Oct 3 13:10:35 2015
ORACLE Instance xxxx1 (pid = 25)-Error 600 encountered while recovering transaction (23, 85 ).
Sat Oct 3 13:10:35 2015
Errors in file/oracle/admin/xxxx/bdump/xxxx1_smon_000020246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sat Oct 3 13:10:35 2015
Trace dumping is refreshing Ming id = [cdmp_20151003131035]
Sat Oct 3 13:10:35 2015
Replication_dependency_tracking turned off (no async multimaster replication found)
Sat Oct 3 13:10:36 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sat Oct 3 13:10:36 2015
Errors in file/oracle/admin/xxxx/bdump/xxxx1_smon_000020246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sat Oct 3 13:10:37 2015
Starting background process QMNC
Sat Oct 3 13:10:37 2015
ORACLE Instance xxxx1 (pid = 25)-Error 600 encountered while recovering transaction (23, 85 ).
Sat Oct 3 13:10:37 2015
Errors in file/oracle/admin/xxxx/bdump/xxxx1_smon_000020246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
QMNC started with pid = 53, OS id = 7536816
Sat Oct 3 13:10:41 2015
LOGSTDBY: Validating controlfile with logical metadata
Sat Oct 3 13:10:41 2015
LOGSTDBY: Validation complete
Sat Oct 3 13:10:46 2015
Errors in file/oracle/admin/xxxx/bdump/xxxx1_mmon_9110004.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], [], []
Sat Oct 3 13:10:48 2015
Errors in file/oracle/admin/xxxx/udump/xxxx1_ora_6619434.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], []
Completed: alter database open resetlogs
This part of error handling is not difficult. For the first ora-00600 [4137] error, it is obvious that it has something to do with undo, where 23 in (23rd) shows the rollback segment; this error can be easily solved by blocking the rollback segment No. 23rd. Of course, this will cause transaction inconsistency, which is no way, and an undo exception has been thrown, oracle has no way to recover normal transactions.
Secondly, for the 2nd ora-00600 [qertbFetchByRowID] error, the processing is also very simple, it roughly means that through the rowid access to obtain data has an exception, it is obviously related to the index, you can solve this problem by rebuilding the index. The last [kdsgrp1] error is more common. It is usually also an Index problem. Just re-create the index.
It seems that the recovery process is very simple and smooth. However, the real problem begins only when it comes to the real problem.
That is, the last seemingly very simple error ora-00600 [kdsgrp1] error, has produced great difficulties for us. First, let's take a look at the objects involved in this error:


Validate domain 0
Validated domain 0, flags = 0x0
Kwqmnich: current time: 13: 31: 34
Kwqmnich: instance no 0 check_only flag 1
Kwqmnich: initialized job cache structure
Row 0041edda. 2e continuation
File #1 block #126426 slot 47 not found
**************************************** **********
KDSTABN_GET: 0... ntab: 1
CurSlot: 47... nrows: 175
**************************************** **********
*** 2015-10-03 13:31:40. 864
Ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], []
Current SQL statement for this session:
Select owner, NAME, TYPE, COUNT (*) FROM DBA_SOURCE where substr (OWNER, 1000) = 'fmis 'group by owner, NAME, type having count (*)>
----- PL/SQL Call Stack -----
 
Object id on Block? Y
Seg/obj: 0x12 csc: 0x01. b1957474 itc: 3 flg: O typ: 1-DATA
Fsl: 0 fnx: 0x0 ver: 0x01
 
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0009. 00e. 0000000d 0x0c07f50f. 1ea7. 5d -- U-1 fsc 0x0053. b1957475
0x02 0x0017. 005.0006755c 0x0c0774de. 1be3. 38 C --- 0 scn 0x0001. b1957451
0x03 0x0002. 042.00010d39 0x0080b268. 0cf1. 22 C --- 0 scn 0x0001. b1957429
 
Data_block_dump, data header at 0x110f46074
====================
Tsiz: 0x3f88
Hsiz: 0x170
Pbl: 0x110f46074
Bdba: 0x0041edda
76543210
Flag = --------
Ntab = 1
Nrow= 175
Frre = 0
Fsbo = 0x170
Fsearch = 0x783
Avsp = 0x3cbe
Tosp = 0x3d13
0xe: pti [0] nrow = 175 offs = 0
0x12: pri [0] sfll = 1
0x14: pri [1] sfll = 2

We can find that, apart from other non-core objects, an obj # = 18, that is, the obj $ core data dictionary table is also involved. And several indexes in the data dictionary table,
I _obj1, I _obj2, and I _obj3 are all core objects with object_id less than 57. These objects are the core data dictionary objects of bootstrap $. That is, Index cannot pass.
Rebuild, 38003 event or reconstruction in upgrade mode.
Of course, this does not mean that the above data dictionary table cannot be rebuilt at all. I will explain how to recreate it in a later article.
During the analysis, I found that the first two indexes have problems, as shown below:


SQL> analyze table obj $ validate structure;
 
Table analyzed.
 
SQL> select index_name from dba_indexes where table_name = 'obj $ ';
 
INDEX_NAME
------------------------------
I _OBJ1
I _OBJ2
I _OBJ3
 
SQL> analyze index I _OBJ1 validate structure;
Analyze index I _OBJ1 validate structure
*
ERROR at line 1:
ORA-08100: index is not valid-see trace file for diagnostics
 
SQL> analyze index I _OBJ2 validate structure;
Analyze index I _OBJ2 validate structure
*
ERROR at line 1:
ORA-08100: index is not valid-see trace file for diagnostics
 
SQL> analyze index I _OBJ3 validate structure;
 
Index analyzed.

However, we should also note that although there are problems with the first two indexes, not both indexes involved in the above errors are used, in fact, only the first index is used.
Error ora-00600 error. Because the customer wants to export data through the expdp schema method, but found that the execution of the error ora-00600 [kdsgrp1], including exp also reported when the execution
Errors, but the exp tables method does not report errors. Because there are too many objects, there are nearly 0.5 million objects (including tables, indexes, and others ). Obviously, it can only be exported at the user level.
This means that we must fix this error.
Through dump-related blocks, we find the error is very strange, as shown below:


* ** Session id: (1052.1243) 17:56:59. 784
Block Checking: DBA = 4684328, Block Type = KTB-managed data block
* ** Previous block dba: 477a2adoes not match my previous block dba: 477a26
* *** Row 0: key out of order
---- End index block validation
---- For block 4684328 = 0x00477a28
Block header dump: 0x00477a28
Object id on Block? Y
Seg/obj: 0x24 csc: 0x02.80026908 itc: 19 flg: O typ: 2-INDEX
Fsl: 0 fnx: 0x477a1c ver: 0x01
 
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0015. 01f. 00020a0a 0x00867c38. 0862.04 C --- 0 scn 0x0001. 83f32656
0x02 0x002e. 044.00002415 0x0082989a. 0477.52 C --- 0 scn 0x0001. 74d90109
0x03 0x002c. 00a. 201712c5 0x0b804b78. 012b. 70 C --- 0 scn 0x0001. 74d9011c
0x04 0x0002. 001.20.3a72 0x00861aab. 024b. 0a C --- 0 scn 0x0001. 74fa8cf1
0x05 0x002f. 02e. 10965eb 0x0b816866. 0217.5b C --- 0 scn 0x0001. 74fb0205
0x06 0x0026. 011.00002000 0x0b814f02. 016d. 14 C --- 0 scn 0x0001. 74fc3dbf
0x07 0x0006. 011.20.3b1c 0x0084c65b. 01f9. 10 C --- 0 scn 0x0001. 74fde478
0x08 0x0029. 05f. 00003258 0x0084faa6. 017c. 2f C --- 0 scn 0x0001. 74 fdedad
0x09 0x001b. 023.20.3fe6 0x008a8718. 025b. 3b C --- 0 scn 0x0001. 74 fdfedc
0x0a 0x001b. 006.00004057 0x0b804f24. 025c. 33 C --- 0 scn 0x0001. 751a25ed
0x0b 0x001b. 02a. 00004039 0x0b804f24. 025c. 49 C --- 0 scn 0x0001. 751a2609
0x0c 0x001b. 024.415404f 0x0b804f24. 025c. 63 C --- 0 scn 0x0001. 751a2627
0x0d 0x000a. 018.00021b0b 0x0b821c8b. 04cc. 45 C --- 0 scn 0x0001. 751a263f
0x0e 0x000a. 055.00021b0a 0x0b821c8b. 04cc. 5e C --- 0 scn 0x0001. 751a265c
0x0f 0x000a. 031.00021aec 0x0b821c8c. 04cc. 09 C --- 0 scn 0x0001. 751a2678
0x10 0x0022. 05e. 1271dbd 0x008a88df. 01e0. 2e C --- 0 scn 0x0001. 74d8ff09
0x11 0x0025. 010.20.1ead 0x00860446. 01ac. 20 C --- 0 scn 0x0001. 74d8ff7d
0x12 0x002b. 043.00001228 0x0084dc65. 0164.11 C --- 0 scn 0x0001. 74d8ffe2
0x13 0x001f. 015.1_1de8 0x0b81a76a. 0183.4e C --- 0 scn 0x0001. 74d90071
 
Leaf block dump
====================
Header address 504403185228956148 = 0x70000064725c1f4
Kdxcolev0
Kdxcolevflags = ---
Kdxcolok 0
Kdxcoopc 0x80: opcode = 0: iot flags = --- is converted = Y
Kdxconco 1
Kdxcosdc 2
Kdxconro 0
Kdxcofbo 36 = 0x24
Kdxcofeo 15816 = 0x3dc8
Kdxcoavs 15780
Kdxlespl 0
Kdxlende 0
Kdxlenxt 4684330 = 0x477a2a
Kdxleprv 4684326 = 0x477a26
Kdxledsz 6
Kdxlebksz15816
* ** Dummy key ***
Row #0 [9549] flag: --- D --, lock: 0, len = 14, data :( 6): 00 44 03 10 00 7c
Col 0; len 5; (5): c4 04 24 25 2f
----- End of leaf block dump -----
Dumping parent of upted subtree, row # = 1202
Dumping parent of upted subtree, row # = 3

I met this index error for the first time and discussed it with old bear. He thinks it may be caused by index split. I also used bbed remotely.
We analyzed nearly 10 index blocks before and after, and compared the index linked list to find that they do not match.
In this case, it is very difficult to use bbed to fix the index. Therefore, you must give up this method. Finally, you can only process the data dictionary table.
I _obj1 and I _obj2 indexes. Finally, let the customer export exp at the user level, but this export takes a long time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.