Problem description: When I arrived at the office in the morning, a monitoring staff sent an email to report that the ndmcb407 database was restarted last night. Let me analyze the cause of the database restart. Because a business version was launched last night, the text message warning was closed, so no text message was sent to my mobile phone, And no one informed me of the fault. 1. Check alert logs from aler
Problem description: When I arrived at the office in the morning, a monitoring staff sent an email to report that the ndmcb407 database was restarted last night. Let me analyze the cause of the database restart. Because a business version was launched last night, the text message warning was closed, so no text message was sent to my mobile phone, And no one informed me of the fault. 1. Check alert logs from aler
Problem description:
When I arrived at the office in the morning, I received an email from the monitoring staff. The ndmcb407 database was restarted last night. Let me analyze the cause of the database restart. Because a business version was launched last night, the text message warning was closed, so no text message was sent to my mobile phone, And no one informed me of the fault.
1. Check alert logs
From the alert Log, we can see that at, a job failed to run: fri Aug 22 03:29:29 2014 Errors in file/opt/oracle/diag/rdbms/ndmcb/trace/ndmcb_j000_28856.trc: ORA-12012: error on auto execute of job 31ORA-04023: ObjectNDMC. delete_anony_r1__info cocould not be validated or authorizedORA-06512: at "NDMC. PROC_NDMC_CANCEL_OPEN ", line 5ORA-06512: at line 1 and then at, a connection timeout failure occurs and continues until 05: 00: 08: fri Aug 22 03:49:43 2014 ******** **************************************** * ********************** Fatal NI connect error 12170. version information: TNS for Linux: Version 11.1.0.7.0-Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0-Production TCP/IP NT Protocol Adapter for Linux: Version 11.1.0.7.0-Production Time: 22-AUG-2014 03:49:43 Tracing not turned on. tnserror struct: ns main err code: 12535 TNS-12535: TNS: operation timed out ns secondary err code: 12606 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 Client address: (ADDRESS = (PROTOCOL = tcp) (HOST = 192.168.130.87) (PORT = 36628) WARNING: inbound connection timed out (ORA-3136) Fri Aug 22 03:49:44 2014 ...... In addition, the number of connections has been exhausted: Fri Aug 22 03:49:50 2014ORA-00020: maximum number of processes 0 exceeded ns secondary err code: 12560 ns secondary err code: 12560 ns main err code: 12537Fri Aug 22 03:49:50 2014 ...... Fri Aug 22 03:51:48 2014 *********************************** * *********************************** Fatal NI connect error 12537, connectingto: (LOCAL = NO) version information: TNS for Linux: Version 11.1.0.7.0-Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0-Production TCP/IP NT Protocol Adapter for Linux: version 11.1.0.7.0-Production Time: 22-AUG-2014 03:51:48 Tracin G not turned on. tnserror struct: ns main err code: 12537 TNS-12537: TNS: connection closedns secondaryerr code: 12560 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0ORA-609: opiodr aborting process unknownospid (random) Fri Aug 22 04:14:15 2014ORA-28: opiodr aborting process unknownospid (random) Fri Aug 22 04:16:27 2014ORA-28: opiodr aborting process un Knownospid (updated) Fri Aug 22 04:16:28 2014ORA-28: opiodr aborting process finished (updated) Fri Aug 22 04:16:29 2014ORA-28: opiodr aborting process finished (updated) ORA-28: opiodr aborting process finished (updated) ...... Fri Aug 22 05:00:05 2014ORA-28: opiodr aborting process unknownospid (unknown) Fri Aug 22 05:00:08 2014ORA-28: opiodr aborting process unknownospid (unknown) So the database is shut down, this is a normal shutdown. It is suspected that the database is manually disabled or that the dual-host VCS has automatically disabled the database: Fri Aug 22 05:04:10 2014 Stopping background process SMCOStopping background process FBDAShutting down instance: further login Aug 22 05:04:12 2014 Stopping background process CJQ0Stopping background process QMNCStopping background process MMNLStopping background process MMONShutting down instance (immediate) License high water mark = 1220 Stopping Job queue slave processes, flags = 7Fri Aug 22 05:04:20 2014 Waiting for Job queue slaves to completeJob queue slave processes stoppedFri Aug 22 05:09:11 2014 License high water mark = 1220 USER (ospid: 25110 ): terminating theinstanceTermination issued to instance processes. waiting for the processes to exitFri Aug 22 05:09:21 2014 Instance termination failed to kill one ormore processesInstance terminated by USER, pid = 25110
2. Check the messages log
At about 05:03:51, I tried to switch the dual-host to the standby server:
Aug 22 05:03:51 NDMCDB11 user_cmd:2014-08-22 05:03:51 hagrp -switch RCS_DB_SG -to system by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]Aug 22 05:04:01 NDMCDB11/usr/sbin/cron[15348]: (root) CMD (su - root -c'/opt/watchdog/watchdog_schedule -n OS,oracle' >/dev/null 2>&1)Aug 22 05:04:01 NDMCDB11 su: (to root) rooton noneAug 22 05:04:03 NDMCDB11 su: (to oracle)root on noneAug 22 05:04:09 NDMCDB11 user_cmd:2014-08-22 05:04:09 hagrp -switch RCS_DB_SG -to NDMCDB12 by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]Aug 22 05:04:09 NDMCDB11 su: (to oracle)root on none
However, the dual-machine switchover fails. In the end, the dual-machine is stopped directly and the VCS is restarted:
Aug 22 05:06:18 NDMCDB11 user_cmd:2014-08-22 05:06:18 hastop -all by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]……Aug 22 05:07:02 NDMCDB11 user_cmd:2014-08-22 05:07:02 hastat by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]
Therefore, we have confirmed that the database has been restarted because of manual restart of the VCS cluster. So why do we need to restart the VCS cluster? Is there a problem with the database? Let's take a look.
Finally, the upgrading personnel confirmed that there was a stored procedure to be run during the upgrade, but after the execution, the basic response of the database was very slow and it was running until around, this is why the following error occurs:
Fri Aug 22 03:29:29 2014Errors in file/opt/oracle/diag/rdbms/ndmcdb/NDMCDB/trace/NDMCDB_j000_28856.trc:ORA-12012: error on auto execute of job 31ORA-04023: ObjectNDMC.DELETE_ANONY_RSHARE_INFO could not be validated or authorizedORA-06512: at"NDMC.PROC_NDMC_CANCEL_OPEN", line 5ORA-06512: at line 1
3. View System Load
CPU load:
Memory load:
It can be seen that the system is running out of CPU and memory at around. During this period, a large number of connection timeout failures occur in the database, or even the number of connections exceeds the threshold value:
Fri Aug 22 03:49:50 2014ORA-00020: maximum number of processes 0exceeded ns secondary err code: 12560 ns secondary err code: 12560 ns main err code: 12537Fri Aug 22 03:49:50 2014
4. AWR Analysis
From this point of view, the database is very busy from to, but from the previous system load, from to, the CPU and memory usage is not very high. Next, let's see:
There are no very high indicators.
From the top 5 events, we can see that there are a large number of cursor: pin S wait on X waits. We can see that mutex contention occurs, but this is usually just a representation, not a root cause.
The root cause of database hang is that SQL parsing has failed most of the time. Normally, most of the time of a database is used for SQL Execution, so this takes the most time: SQL execute elapsedtime.
There is no higher versioncount.
When does the database fail to parse SQL statements?
I checked DBA_HIST_ACTIVE_SESS_HISTORY, analyzed the historical session information, and found the problems that occurred between 02:57:00 and 03:00:00:
After confirmation, it happens to be about the execution time of the stored procedure.
So far, the database has been abnormal since, and the database is constantly parsing SQL and SQL has not yet been executed. The database is already in a unresponsive state, connection sessions are blocked until the number of connections reaches the maximum number of connections, and the upgrade operator restarts the VCS cluster.
5 analysis conclusion
(1) the reason for the failure of manual VCS switching is that the VCS restart operation is performed on the database down machine.
(2) the root cause of this database failure is why the database failed to parse SQL at around. According to the current log analysis, there is no reason.
-- Bosco ---- END ----