The second day of the accident, the alertlog of the database is analyzed. From the log, we can see that after the database is down on instance NODE2, RAC has made the instance switching step, but
On the second day of the accident, the alert log of the database is analyzed. From the log, we can see that after the database goes down on instance node2.
1. Environment Description
OS: AIX6.1
Oracle: 11.2.0.3.0 RAC
2. Accidents
The minicomputer where database NODE 2 is located has a downtime accident. It should have been switched to node1. however, the switchover failed and the system was restarted to solve the problem.
3. Accident Analysis
On the second day of the accident, the alert log of the database is analyzed. From the log, we can see that after the database goes down on instance NODE2, RAC has performed the instance switching step, but in the process of switching encountered a ORA-00240, ORA-29770 error, resulting in the database was not successfully switched. The following is a detailed log analysis.
At that time, the database experienced the following important steps:
1. Beginning instance recovery of 1 threads
The database starts to recover the instance that is down locally.
2. Started redo application
Thread 2: logseq 4556, block 368380
The database starts to recover logs with the online redo log no. 4556.
3. Completed instance recovery
Thread 2: logseq 4556, block 376983, scn 3502123313
Database redo log 4556 has been restored successfully.
4. Redo thread 2 internally disabled at seq 4557 (SMON)
A failure occurs when the database is preparing to restore 4557 of logs.
5 ORA-00240: control file enqueue held for more than 120 seconds
The ora-00240 error started in the log and prompted that the control file was held for more than 120 seconds.
6 ORA-29770: global enqueue process DIA0 (OSID 12517556) is hung for more than 300 seconds
Incident details in:
Then the ora29770 error occurs, and the database process DIA0 hung stays for more than 5 minutes.
During the 20 minutes or so after the instance went down on the same day, the system monitoring and other scripts were not executed at that time. The Oracle awr report was analyzed and found that the instance was down on the same day, the CPU resources of the node 1 system are almost exhausted without downtime.
After analysis and speculation, the DIA0 process is mainly used to handle database deadlocks and hung processes. The log shows that the control file is held for over 120 seconds, the process started to handle this problem, but when the system cpu resources were exhausted, the process DIA0 solved the fault for more than 5 minutes, resulting in a ora-29770 error.
So it is determined that the main cause of this instance switching failure is the ORA-00240 error. View the specific error cause in the trace file based on the trace file in the log.
According to the analysis in the trace log, the main reason why the control file was held for more than 120 seconds was that the KSV master wait waited, and the KSV master wait took about 2 minutes 3 seconds.
4. Accident Analysis conclusion
Based on the above phenomena and log embodiment, find information from the official oracle metalink and find that this is a bug [bug id 1308282.1]
The following is an explanation of this bug in the official metalink Document:
High 'ksv master wait' And 'asm File Metadata operation' Waits In Non-Exadata 11g
Symptoms
High waits for 'ksv master wait' while doing an ASM file metadata operation were reported when a data migration utility was running. This wait was also seen for a drop of a tablespace.
The AWR showed the top events were CPU (> 100%), with 'asm file metadata operation' (7% ).
Cause
Event 'ksv master wait' indicates the process on the RDBMS side is waiting for a reply from a process on the ASM side. in 11g, the parameter cell_offload_processing is set to TRUE. although that is a parameter is not applicable for non-Exadata databases, it caused ASM to try to deliver smart-scan results. the issue was reported in Bug 11800170-asm in ksv wait after application of 11.2.0.2 grid psu.
After applying the workaround for this issue (see Solution below), a drop of a tablespace that used to take 13 minutes took 4 seconds.
Solution
The following solutions are available for non-Exadata databases:
For the quickest solution, use the workaround. The workaround does not negatively impact non-Exadata databases. This parameter is to be set on the database instance.
Alter system set cell_offload_processing = false;
Upgrade to 12.1, when available. OR
Apply the 11.2.0.3 patch set OR
Apply one-off Patch 11800170, if available for your RDBMS and Grid Homes
Note: At the time this note was written (March 2011), neither 12.1 nor 11.2.0.3 were available.
The fastest solution provided in the official documentation is to modify cell_offload_processing in oracle to false.