Oracle RAC (ASM) recovery case (super large database)

Source: Internet
Author: User
Tags error code

A customer's database was faulty a few days ago and we needed emergency support. After learning about the environment, I was shocked. The data volume was about 55 TB. First, let's look at the fault information:
 

Fri Mar 25 22:57:10 2016
Errors in file/oracle/app/oracle/diag/rdbms/njsdb/njsdb1/trace/xxxx1_dia0_30474350.trc (incident = 640081 ):
ORA-32701: Possible hangs up to hang ID = 80 detected
Incident details in:/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_640081/xxxx1_dia0_30474350_i640081.trc
DIA0 requesting termination of session sid: 10347 with serial #16419 (ospid: 29884706) on instance 2
Due to a LOCAL, HIGH confidence hang with ID = 80.
Hang Resolution Reason: Although the number of affected sessions did not
Justify automatic hang resolution initially, this previusly ignored
Hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID = 80.
Fri Mar 25 22:59:26 2016
Minact-scn: useg scan erroring out with error e: 12751
Suspending MMON action 'block Cleanout Optim, Undo Segment Scan 'for 82800 seconds
Fri Mar 25 22:59:35 2016
Sweep [inc] [640081]: completed
Sweep [inc2] [640081]: completed
Fri Mar 25 22:59:57 2016
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw7_27263388.trc:
ORA-27063: number of bytes read/written is incorrect
Ibm aix risc System/6000 Error: 78: Connection timed out
Additional information:-1
Additional information: 8192
WARNING: Write Failed. group: 3 disk: 35 AU: 241007 offset: 262144 size: 8192
Fri Mar 25 22:59:57 2016
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_10289540.trc:
ORA-27072: File I/O error.
Ibm aix risc System/6000 Error: 78: Connection timed out
Additional information: 7
Additional information: 41229344
Additional information:-1
WARNING: Read Failed. group: 3 disk: 35 AU: 20131 offset: 540672 size: 8192
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw7_27263388.trc:
ORA-27063: number of bytes read/written is incorrect
Ibm aix risc System/6000 Error: 78: Connection timed out
Additional information:-1
Additional information: 8192
WARNING: Write Failed. group: 3 disk: 35 AU: 241007 offset: 237568 size: 8192
Fri Mar 25 22:59:57 2016
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-27063: number of bytes read/written is incorrect
Ibm aix risc System/6000 Error: 78: Connection timed out
Additional information:-1
Additional information: 8192
......
WARNING: failed to write mirror side 1 of virtual extent 5096 logical extent 0 of file 583 in group 3 on disk 35 allocation unit 241007
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-27063: number of bytes read/written is incorrect
Ibm aix risc System/6000 Error: 78: Connection timed out
Additional information:-1
Additional information: 8192
......
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 5096 logical extent 0 of file 583 in group 3 on disk 35 allocation unit 241007
WARNING: failed to write mirror side 1 of virtual extent 5002 logical extent 0 of file 585 in group 3 on disk 35 allocation unit 242538
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-15080: synchronous I/O operation to a disk failed

From the preceding logs, we can see that the database node reported an error starting from on the 25 th, which may be due to some session hung, followed by a write failure operation. The write failure is 35th disks.
Of course, this is only a warning, so we cannot determine whether the disk is faulty.
Later, we learned from the customer that the storage link encountered exceptions, resulting in database IO exceptions. This is also in line with the previous description of the phenomenon.
Then, let's further analyze the operations of the following customers and see what operations they have performed before?
 

Sat Mar 26 01:13:51 2016
ALTER DATABASE OPEN
This instance was first to open
Abort recovery for domain 0
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_27853116.trc:
ORA-01113: file 763 needs media recovery
ORA-01110: data file 763: '+ DATA/xxxx/datafile/ts_icp_bill.1028.8820.3603'
ORA-1113 signalled during: alter database open...
Sat Mar 26 01:14:14 2016
......
......
Sat Mar 26 02:02:14 2016
Alter database recover database
Media Recovery Start
Started logmerger process
Sat Mar 26 02:02:18 2016
WARNING! Recovering data file 763 from a fuzzy backup. It might be an online
Backup taken without entering the begin backup command.
.....
WARNING! Recovering data file 779 from a fuzzy backup. It might be an online
Backup taken without entering the begin backup command.
Sat Mar 26 02:02:18 2016
......
Sat Mar 26 02:04:15 2016
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pr00_36110522.trc:
ORA-01547: warning: RECOVER succeeded but open resetlogs wocould get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+ DATA/xxxx/datafile/system.261.880958693'
Slave exiting with ORA-1547 exception
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pr00_36110522.trc:
ORA-01547: warning: RECOVER succeeded but open resetlogs wocould get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+ DATA/xxxx/datafile/system.261.880958693'
ORA-1547 signalled during: alter database recover cancel...
Sat Mar 26 02:04:43 2016
 
We can see that the customer has performed normal alter database open, but Oracle prompts that some files need to be recovered. If you perform the recover database operation, some files may come from fuzzy backup.
What does this mean? In fact, the check points of these files are old, and the logs need to be recovered a long time ago.
Because this is a non-archive database, it is very likely that these files need logs to be overwritten.
By comparing the scn, we can determine that the redo information required by these files has been overwritten. Here, I would like to remind you that you should not simply read alert logs for easy judgment.
Looking at alert log alone, we may think there are only a few File problems. In the future, I think that if there are only a few files with problems, then I can skip this part of the file for recover? In this way, data can be restored to the maximum extent.
So I executed the following command:
 

Run {
Set until scn 14586918070973;
Recover database skip forever tablespace ts_icp_bill, ts_icp_bill_idx, ts_wj_bill, ts_wj_bill_idx, ts_js_bill;
}
The above command is actually fatal, because Oracle will perform offline drop on all the files in the table space of the skip here.
Therefore, the above practice is actually defective.
I further compared the scn information of the file with the scn information of v $ log and found that 605 files may require recover, because there are about 2000 data files in the database.
Here I make a rough judgment based on scn and generate two scripts for file-level recover. The roughly obtained script is as follows:


Spool recover1.sh
Set pagesize 160 long 9999
Select 'recover datafile' | file # | ';' from v $ datafile_header where checkpoint_change # <xxx;
Spool off
 
After restoring other files that can be normally recovered, try to open the database. The database can be opened normally. Some may have ended up here, but it is not.
Think about it? Although the database is opened, some of the 605 data files that we cannot normally recover may be in the "recover" or "not online" status.
In this case, the business cannot be accessed. In fact, I checked it here. About 540 files are still in the recover status. Therefore, we still need to find a way to talk about this part of files online.
The processing method is actually not difficult. For example, you can simply modify the checkpoint information of the data file header through bbed. However, there are 540 files in the ASM environment.
The workload of this modification is sufficient. A script is generated later to start the database to the mount state, and then all the files in the previous state as recover are online.
Then perform the recover database using backup controlfile operation. Then, directly execute alter database open resetlogs.
Unfortunately, the following error was reported when the database was not directly opened. This error is common and mos may be related to temp.
 

  
 
 
Here, I will directly drop the tempfile, re-create the control file, and recover it, and then open the database directly.
Check the alert log and I found that the following error still exists:
 


Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
Mem #0: + DATA/xxxx/onlinelog/group_1.2177.907527869
Mem #1: + FRA/xxxx/onlinelog/group_1.263.907527881
Block recovery completed at rba 1.46.16, scn 3396.1209160607
ORACLE Instance xxxx1 (pid = 26)-Error 607 encountered while recovering transaction (12, 33) on object 143154.
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc:
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [6856], [0], [1], [], [], [], [], [], [], [], [], [], []
Sat Mar 26 19:05:43 2016
Dumping diagnostic data in directory = [cdmp_201603261_543], requested by (instance = 1, osid = 10289168 (SMON), summary = [incident = 2400209].
Starting background process QMNC
Sat Mar 26 19:05:44 2016
QMNC started with pid = 40, OS id = 17432578
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Sat Mar 26 19:05:45 2016
Sweep [inc] [2400209]: completed
Sweep [inc2] [2400209]: completed
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc (incident = 2400210 ):
ORA-00600: internal error code, arguments: [6856], [0], [13], [], [], [], [], [], [], [], [], [], []
Incident details in:/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_2400210/xxxx1_smon_10289168_i2400210.trc
Dumping diagnostic data in directory = [cdmp_20160326190545], requested by (instance = 1, osid = 10289168 (SMON), summary = [abnormal process termination].
Starting background process CJQ0
Sat Mar 26 19:05:46 2016
CJQ0 started with pid = 43, OS id = 11010116
Sat Mar 26 19:05:47 2016
Db_recovery_file_dest_size of 2047887 MB is 0.21% used. This is
User-specified limit on the amount of space that will be used by this
Database for recovery-related files, and does not reflect the amount
Space available in the underlying filesystem or ASM diskgroup.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 1, block 60 to scn 14586918097855
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
Mem #0: + DATA/xxxx/onlinelog/group_1.2177.907527869
Mem #1: + FRA/xxxx/onlinelog/group_1.263.907527881
Dumping diagnostic data in directory = [cdmp_201603261_547], requested by (instance = 1, osid = 10289168 (SMON), summary = [incident = 2400210].
Block recovery completed at rba 1.91.16, scn 3396.1209160640
ORACLE Instance xxxx1 (pid = 26)-Error 607 encountered while recovering transaction (15, 3) on object 143247.
Errors in file/oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc:
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [6856], [0], [13], [], [], [], [], [], [], [], [], [], []
Starting background process SMCO
Dumping diagnostic data in directory = [cdmp_201603261_timeout], requested by (instance = 1, osid = 10289168 (SMON), summary = [abnormal process termination].
Sat Mar 26 19:05:49 2016
SMCO started with pid = 46, OS id = 2949376
Setting Resource Manager plan SCHEDULER [0x3198]: DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Sat Mar 26 19:05:49 2016
 
Obviously, the above error indicates that two transactions cannot be recovered when the smon process recovers the transaction.
When you see the preceding error, some people may say that the undo may be damaged, leading to the failure to recover the transaction. Actually, this is not the case here. I checked through dbv and found that the undo files are all intact.
In any case, it is relatively simple to solve this problem here. Just find the object and recreate it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.