Analysis of hang failures in the ndmcb database, ndmcbhang

Source: Internet
Author: User

Analysis of hang failures in the ndmcb database, ndmcbhang

Problem description:

When I arrived at the office in the morning, I received an email from the monitoring staff. The ndmcb407 database was restarted last night. Let me analyze the cause of the database restart. Because a business version was launched last night, the text message warning was closed, so no text message was sent to my mobile phone, And no one informed me of the fault.

1. Check alert logs

From the alert Log, we can see that at, a job failed to run: fri Aug 22 03:29:29 2014 Errors in file/opt/oracle/diag/rdbms/ndmcb/trace/ndmcb_j000_28856.trc: ORA-12012: error on auto execute of job 31ORA-04023: ObjectNDMC. delete_anony_r1__info cocould not be validated or authorizedORA-06512: at "NDMC. PROC_NDMC_CANCEL_OPEN ", line 5ORA-06512: at line 1 and then at, a connection timeout failure occurs and continues until 05: 00: 08: fri Aug 22 03:49:43 2014 ******** **************************************** * ********************** Fatal NI connect error 12170. version information: TNS for Linux: Version 11.1.0.7.0-Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0-Production TCP/IP NT Protocol Adapter for Linux: Version 11.1.0.7.0-Production Time: 22-AUG-2014 03:49:43 Tracing not turned on. tnserror struct: ns main err code: 12535 TNS-12535: TNS: operation timed out ns secondary err code: 12606 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 Client address: (ADDRESS = (PROTOCOL = tcp) (HOST = 192.168.130.87) (PORT = 36628) WARNING: inbound connection timed out (ORA-3136) Fri Aug 22 03:49:44 2014 ...... In addition, the number of connections has been exhausted: Fri Aug 22 03:49:50 2014ORA-00020: maximum number of processes 0 exceeded ns secondary err code: 12560 ns secondary err code: 12560 ns main err code: 12537Fri Aug 22 03:49:50 2014 ...... Fri Aug 22 03:51:48 2014 *********************************** * *********************************** Fatal NI connect error 12537, connectingto: (LOCAL = NO) version information: TNS for Linux: Version 11.1.0.7.0-Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0-Production TCP/IP NT Protocol Adapter for Linux: version 11.1.0.7.0-Production Time: 22-AUG-2014 03:51:48 Tracin G not turned on. tnserror struct: ns main err code: 12537 TNS-12537: TNS: connection closedns secondaryerr code: 12560 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0ORA-609: opiodr aborting process unknownospid (random) Fri Aug 22 04:14:15 2014ORA-28: opiodr aborting process unknownospid (random) Fri Aug 22 04:16:27 2014ORA-28: opiodr aborting process un Knownospid (updated) Fri Aug 22 04:16:28 2014ORA-28: opiodr aborting process finished (updated) Fri Aug 22 04:16:29 2014ORA-28: opiodr aborting process finished (updated) ORA-28: opiodr aborting process finished (updated) ...... Fri Aug 22 05:00:05 2014ORA-28: opiodr aborting process unknownospid (unknown) Fri Aug 22 05:00:08 2014ORA-28: opiodr aborting process unknownospid (unknown) So the database is shut down, this is a normal shutdown. It is suspected that the database is manually disabled or that the dual-host VCS has automatically disabled the database: Fri Aug 22 05:04:10 2014 Stopping background process SMCOStopping background process FBDAShutting down instance: further login Aug 22 05:04:12 2014 Stopping background process CJQ0Stopping background process QMNCStopping background process MMNLStopping background process MMONShutting down instance (immediate) License high water mark = 1220 Stopping Job queue slave processes, flags = 7Fri Aug 22 05:04:20 2014 Waiting for Job queue slaves to completeJob queue slave processes stoppedFri Aug 22 05:09:11 2014 License high water mark = 1220 USER (ospid: 25110 ): terminating theinstanceTermination issued to instance processes. waiting for the processes to exitFri Aug 22 05:09:21 2014 Instance termination failed to kill one ormore processesInstance terminated by USER, pid = 25110

2. Check the messages log

At about 05:03:51, I tried to switch the dual-host to the standby server:

Aug 22 05:03:51 NDMCDB11 user_cmd:2014-08-22 05:03:51 hagrp -switch RCS_DB_SG -to system by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]Aug 22 05:04:01 NDMCDB11/usr/sbin/cron[15348]: (root) CMD (su - root -c'/opt/watchdog/watchdog_schedule -n OS,oracle' >/dev/null 2>&1)Aug 22 05:04:01 NDMCDB11 su: (to root) rooton noneAug 22 05:04:03 NDMCDB11 su: (to oracle)root on noneAug 22 05:04:09 NDMCDB11 user_cmd:2014-08-22 05:04:09 hagrp -switch RCS_DB_SG -to NDMCDB12 by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]Aug 22 05:04:09 NDMCDB11 su: (to oracle)root on none

However, the dual-machine switchover fails. In the end, the dual-machine is stopped directly and the VCS is restarted:

Aug 22 05:06:18 NDMCDB11 user_cmd:2014-08-22 05:06:18 hastop -all by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]……Aug 22 05:07:02 NDMCDB11 user_cmd:2014-08-22 05:07:02 hastat by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]

Therefore, we have confirmed that the database has been restarted because of manual restart of the VCS cluster. So why do we need to restart the VCS cluster? Is there a problem with the database? Let's take a look.

 

Finally, the upgrading personnel confirmed that there was a stored procedure to be run during the upgrade, but after the execution, the basic response of the database was very slow and it was running until around, this is why the following error occurs:

Fri Aug 22 03:29:29 2014Errors in file/opt/oracle/diag/rdbms/ndmcdb/NDMCDB/trace/NDMCDB_j000_28856.trc:ORA-12012: error on auto execute of job 31ORA-04023: ObjectNDMC.DELETE_ANONY_RSHARE_INFO could not be validated or authorizedORA-06512: at"NDMC.PROC_NDMC_CANCEL_OPEN", line 5ORA-06512: at line 1

3. View System Load

CPU load:


Memory load:


It can be seen that the system is running out of CPU and memory at around. During this period, a large number of connection timeout failures occur in the database, or even the number of connections exceeds the threshold value:

Fri Aug 22 03:49:50 2014ORA-00020: maximum number of processes 0exceeded   ns secondary err code: 12560   ns secondary err code: 12560   ns main err code: 12537Fri Aug 22 03:49:50 2014

4. AWR Analysis


From this point of view, the database is very busy from to, but from the previous system load, from to, the CPU and memory usage is not very high. Next, let's see:


There are no very high indicators.


From the top 5 events, we can see that there are a large number of cursor: pin S wait on X waits. We can see that mutex contention occurs, but this is usually just a representation, not a root cause.


The root cause of database hang is that SQL parsing has failed most of the time. Normally, most of the time of a database is used for SQL Execution, so this takes the most time: SQL execute elapsedtime.


There is no higher versioncount.

 

When does the database fail to parse SQL statements?

I checked DBA_HIST_ACTIVE_SESS_HISTORY, analyzed the historical session information, and found the problems that occurred between 02:57:00 and 03:00:00:


After confirmation, it happens to be about the execution time of the stored procedure.

So far, the database has been abnormal since, and the database is constantly parsing SQL and SQL has not yet been executed. The database is already in a unresponsive state, connection sessions are blocked until the number of connections reaches the maximum number of connections, and the upgrade operator restarts the VCS cluster.

 

5 analysis conclusion

(1) the reason for the failure of manual VCS switching is that the VCS restart operation is performed on the database down machine.

(2) the root cause of this database failure is why the database failed to parse SQL at around. According to the current log analysis, there is no reason.


-- Bosco ---- END ----

Bytes -------------------------------------------------------------------------------------------------------

All rights reserved. reprinted articles are allowed, but source addresses must be indicated by links. Otherwise, the documents will be held legally responsible!


 


If a hardware or software fault occurs in the database, analyze the cause of the fault and how to recover the fault

Is that a problem...
[Software fault] is often referred to as a system fault. It refers to any event that causes the database system to stop running, so that the system must be restarted. Possible causes: Are there specific types of hardware errors (such as cpu faults), operating system faults, DBMS code errors, and system power outages?
[Hardware fault] is also called a media fault. It refers to an external memory fault, such as a magnetic head collision, disk damage, and instantaneous strong magnetic field interference.

The database recovery technology is required for recovery after a database fault occurs!
You can use "data dump" or "register Log Files". In fact, the recovery technologies for different faults are similar, but the technologies used are similar and the principles are the same, but the specific implementation method is different.

Hope to help you o (∩ _ ∩) o
 
A computer cannot access the Home Page. The network structure diagram on the left and the logical structure diagram on the right are used to analyze the fault factors from the network, system, and database.

Check whether the IP address of the ping domain name is correct!
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.