ORA-00240 error caused by instance downtime

Source: Internet
Author: User
Tags bug id metalink
The second day of the accident, the alertlog of the database is analyzed. From the log, we can see that after the database is down on instance NODE2, RAC has made the instance switching step, but

On the second day of the accident, the alert log of the database is analyzed. From the log, we can see that after the database goes down on instance node2.

1. Environment Description

OS: AIX6.1

Oracle: 11.2.0.3.0 RAC

2. Accidents

The minicomputer where database NODE 2 is located has a downtime accident. It should have been switched to node1. however, the switchover failed and the system was restarted to solve the problem.

3. Accident Analysis

On the second day of the accident, the alert log of the database is analyzed. From the log, we can see that after the database goes down on instance NODE2, RAC has performed the instance switching step, but in the process of switching encountered a ORA-00240, ORA-29770 error, resulting in the database was not successfully switched. The following is a detailed log analysis.

At that time, the database experienced the following important steps:


1. Beginning instance recovery of 1 threads

The database starts to recover the instance that is down locally.

2. Started redo application

Thread 2: logseq 4556, block 368380

The database starts to recover logs with the online redo log no. 4556.

3. Completed instance recovery

Thread 2: logseq 4556, block 376983, scn 3502123313

Database redo log 4556 has been restored successfully.

4. Redo thread 2 internally disabled at seq 4557 (SMON)

A failure occurs when the database is preparing to restore 4557 of logs.

5 ORA-00240: control file enqueue held for more than 120 seconds

The ora-00240 error started in the log and prompted that the control file was held for more than 120 seconds.

6 ORA-29770: global enqueue process DIA0 (OSID 12517556) is hung for more than 300 seconds

Incident details in:

Then the ora29770 error occurs, and the database process DIA0 hung stays for more than 5 minutes.

During the 20 minutes or so after the instance went down on the same day, the system monitoring and other scripts were not executed at that time. The Oracle awr report was analyzed and found that the instance was down on the same day, the CPU resources of the node 1 system are almost exhausted without downtime.

After analysis and speculation, the DIA0 process is mainly used to handle database deadlocks and hung processes. The log shows that the control file is held for over 120 seconds, the process started to handle this problem, but when the system cpu resources were exhausted, the process DIA0 solved the fault for more than 5 minutes, resulting in a ora-29770 error.

So it is determined that the main cause of this instance switching failure is the ORA-00240 error. View the specific error cause in the trace file based on the trace file in the log.

According to the analysis in the trace log, the main reason why the control file was held for more than 120 seconds was that the KSV master wait waited, and the KSV master wait took about 2 minutes 3 seconds.

4. Accident Analysis conclusion

Based on the above phenomena and log embodiment, find information from the official oracle metalink and find that this is a bug [bug id 1308282.1]

The following is an explanation of this bug in the official metalink Document:


High 'ksv master wait' And 'asm File Metadata operation' Waits In Non-Exadata 11g

Symptoms

High waits for 'ksv master wait' while doing an ASM file metadata operation were reported when a data migration utility was running. This wait was also seen for a drop of a tablespace.

The AWR showed the top events were CPU (> 100%), with 'asm file metadata operation' (7% ).

Cause

Event 'ksv master wait' indicates the process on the RDBMS side is waiting for a reply from a process on the ASM side. in 11g, the parameter cell_offload_processing is set to TRUE. although that is a parameter is not applicable for non-Exadata databases, it caused ASM to try to deliver smart-scan results. the issue was reported in Bug 11800170-asm in ksv wait after application of 11.2.0.2 grid psu.

After applying the workaround for this issue (see Solution below), a drop of a tablespace that used to take 13 minutes took 4 seconds.

Solution

The following solutions are available for non-Exadata databases:

For the quickest solution, use the workaround. The workaround does not negatively impact non-Exadata databases. This parameter is to be set on the database instance.

Alter system set cell_offload_processing = false;

Upgrade to 12.1, when available. OR

Apply the 11.2.0.3 patch set OR

Apply one-off Patch 11800170, if available for your RDBMS and Grid Homes

Note: At the time this note was written (March 2011), neither 12.1 nor 11.2.0.3 were available.

The fastest solution provided in the official documentation is to modify cell_offload_processing in oracle to false.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.