Customer feedback system Hang live, can not be archived, we need urgent intervention analysis
Node 1st log
Redo cannot be archived, redo logs have been filled, and the database has been manually executing alter SYSTEM ARCHIVE log current, but the subsequent redo cannot be archived, and when redo is fully written, Database has a large number of log file switch (archiving needed) waiting
The code is as follows |
Copy Code |
Tue Sep 24 22:05:37 2013 Thread 1 advanced to log sequence 47282 (LGWR switch) Current log# 6 seq# 47282 mem# 0: +data/q9db/onlinelog/group_6.1244.818697409 Tue Sep 24 22:07:31 2013 ORACLE Instance Q9db1-can not allocate log, archival required Thread 1 cannot allocate new log, sequence 47283 All online logs needed archiving Current log# 6 seq# 47282 mem# 0: +data/q9db/onlinelog/group_6.1244.818697409 Tue Sep 24 22:28:17 2013 ALTER SYSTEM ARCHIVE LOG Archived Log Entry 259646 added for thread 1 sequence 47266 ID 0x354620c2 dest 1: Tue Sep 24 22:28:18 2013 Thread 1 advanced to log sequence 47283 (LGWR switch) Current log# 7 seq# 47283 mem# 0: +data/q9db/onlinelog/group_7.1243.818697415 Archived Log Entry 259647 added for thread 1 sequence 47267 ID 0x354620c2 dest 1: Archived Log Entry 259648 added for thread 1 sequence 47268 ID 0x354620c2 dest 1: Archived Log Entry 259649 added for thread 1 sequence 47269 ID 0x354620c2 dest 1: Archived Log Entry 259650 added for thread 1 sequence 47270 ID 0x354620c2 dest 1: Archived Log Entry 259651 added for thread 1 sequence 47271 ID 0x354620c2 dest 1: Archived Log Entry 259652 added for thread 1 sequence 47272 ID 0x354620c2 dest 1: Tue Sep 24 22:28:28 2013 Archived Log Entry 259653 added for thread 1 sequence 47273 ID 0x354620c2 dest 1: Archived Log Entry 259654 added for thread 1 sequence 47274 ID 0x354620c2 dest 1: Archived Log Entry 259655 added for thread 1 sequence 47275 ID 0x354620c2 dest 1: Archived Log Entry 259656 added for thread 1 sequence 47276 ID 0x354620c2 dest 1: Archived Log Entry 259657 added for thread 1 sequence 47277 ID 0x354620c2 dest 1: Archived Log Entry 259658 added for thread 1 sequence 47278 ID 0x354620c2 dest 1: Archived Log Entry 259659 added for thread 1 sequence 47279 ID 0x354620c2 dest 1: Tue Sep 24 22:28:39 2013 Archived Log Entry 259660 added for thread 1 sequence 47280 ID 0x354620c2 dest 1: Archived Log Entry 259661 added for thread 1 sequence 47281 ID 0x354620c2 dest 1: Archived Log Entry 259662 added for thread 1 sequence 47282 ID 0x354620c2 dest 1: Tue Sep 24 22:29:39 2013 Thread 1 advanced to log sequence 47284 (LGWR switch) Current log# 8 seq# 47284 mem# 0: +data/q9db/onlinelog/group_8.1242.818697417 Tue Sep 24 22:31:18 2013 Thread 1 advanced to log sequence 47285 (LGWR switch) Current log# seq# 47285 mem# 0: +data/q9db/onlinelog/group_16.1884.827003545 Thread 1 advanced to log sequence 47286 (LGWR switch) Current log# seq# 47286 mem# 0: +data/q9db/onlinelog/group_17.1885.827003587 |
Node 2nd log
A large number of IPC Send timeout appear in Node 2
The code is as follows |
Copy Code |
Tue Sep 24 15:22:19 2013 IPC Send timeout detected. Sender:ospid 4008 [Oracle@q9db02.800best.com (PING)] ..... Tue Sep 18:51:55 2013 IPC Send timeout detected. Sender:ospid 4008 [Oracle@q9db02.800best.com (PING)] Tue Sep 18:57:54 2013 IPC Send timeout detected. Sender:ospid 4008 [Oracle@q9db02.800best.com (PING)] Receiver:inst 1 binc 464003926 ospid 1566 Tue SEP 24 19:0 3:57 2013 IPC Send timeout detected. Sender:ospid 4008 [Oracle@q9db02.800best.com (PING)] Receiver:inst 1 binc 464003926 ospid 1566 Tue SEP 24 19:0 9:53 2013 IPC Send timeout detected. Sender:ospid 4008 [Oracle@q9db02.800best.com (PING)] ..... Tue Sep 20:22:00 2013 IPC Send timeout detected. Sender:ospid 4008 [Oracle@q9db02.800best.com (PING)] |
Node 1 because the hang can not be archived, Node 2 will then hang live. When the node 1hang live to two nodes to do systemstate dump, using ass to analyze the node 1 and Node 2 records are as follows:
Node 1
The code is as follows |
Copy Code |
393:waiting for ' log file switch (archiving needed) ' 394:waiting for ' log file switch (archiving needed) ' Cmd:insert 395:waiting for ' log file switch (archiving needed) ' Cmd:insert 397:waiting for ' log file switch (archiving needed) ' Cmd:insert 398:waiting for ' log file switch (archiving needed) ' Cmd:insert 451:waiting for ' sql*net Message from client ' 469:waiting for ' log file switch (archiving needed) ' Cmd:insert 470:waiting for ' log file switch (archiving needed) ' Cmd:insert 471:waiting for ' log file switch (archiving needed) ' Cmd:insert 618:waiting for ' log file switch (archiving needed) ' Cmd:insert 626:waiting for ' log file switch (archiving needed) ' Cmd:insert
NO BLOCKING Processes FOUND |
Node 2
The code is as follows |
Copy Code |
515:waiting for ' GC buffer busy acquire ' Cmd:insert 516:waiting for ' GC buffer busy acquire ' Cmd:insert 517:waiting for ' GC buffer busy acquire ' Cmd:insert 518:waiting for ' GC buffer busy acquire ' Cmd:insert 519:waiting for ' GC buffer busy acquire ' Cmd:insert 520:waiting for ' GC buffer busy acquire ' Cmd:select 521:waiting for ' GC current request ' Cmd:insert 522:waiting for ' Enq:tx-row lock contention ' [Enq tx-00ba0020-001e3e3c] Cmd:select 523:waiting for ' GC buffer busy acquire ' Cmd:insert 524:waiting for ' sql*net Message from client ' 525:waiting for ' GC buffer busy acquire ' Cmd:insert 526:waiting for ' GC buffer busy acquire ' Cmd:insert 527:waiting for ' Enq:tx-row lock contention ' [Enq tx-00ba0020-001e3e3c] Cmd:select 528:waiting for ' sql*net Message from client ' 529:waiting for ' GC buffer busy acquire ' Cmd:select
Resource Holder State Enq tx-0005001e-0022374f 223:waiting for ' GC current request ' Enq tx-0047001b-002bceb2 247:waiting for ' GC current request ' Enq tx-015b001e-000041ff 330:waiting for ' GC current request ' Enq TX-00010010-002EA7CD 179:waiting for ' GC current request ' Enq tx-00ba0020-001e3e3c??? Blocker
Object Names ~~~~~~~~~~~~ Enq tx-0005001e-0022374f Enq TX-0047001B-002BCEB2 Enq tx-015b001e-000041ff Enq TX-00010010-002EA7CD Enq tx-00ba0020-001e3e3c |
From here, we can see that many of the transactions of Node 2 hang because of the request of GC current requests, which is because Node 1 cannot be archived, some blocks are not transferred to Node 2 normally, causing node 2 to be hang here, and then the IPC Send Timeout Node 1 is blocked or even hang to live because it cannot be archived. The question to locate is why Node 1 cannot be archived
Continue profiling Node 1 alert log
The code is as follows |
Copy Code |
Tue Sep 24 15:18:20 2013 Opidrv aborting process O000 ospid (7332) as a result of ORA-28 Immediate Kill session#: 1904, serial#: 1065 Immediate Kill session:sess:0x24a2522a38 OS pid:7338 Immediate Kill session#: 3597, serial#: 11107 Immediate Kill session:sess:0x24c27cf498 OS pid:7320 Tue Sep 24 15:18:23 2013 Opidrv aborting process W000 ospid (7980) as a result of ORA-28 Tue Sep 24 15:18:23 2013 Opidrv aborting process W001 ospid (8560) as a result of ORA-28 Tue Sep 24 15:18:35 2013 lgwr:detected ARCH Process Failure lgwr:detected ARCH Process Failure lgwr:detected ARCH Process Failure lgwr:detected ARCH Process Failure Lgwr:starting ARCH Processes Tue Sep 24 15:18:35 2013 ARC0 started with pid=66, OS id=10793 Tue Sep 24 15:18:35 2013 Errors in FILE/U01/APP/ORACLE/DIAG/RDBMS/Q9DB/Q9DB1/TRACE/Q9DB1_NSA2_12635.TRC: Ora-00028:your session has been killed Lns:failed to archive log 8 thread 1 sequence 47156 (28) Arc0:archival started Lgwr:starting ARCH Processes COMPLETE Thread 1 advanced to log sequence 47157 (LGWR switch) Arc0:starting ARCH Processes Current log# 9 seq# 47157 mem# 0: +data/q9db/onlinelog/group_9.1241.818697421 Tue Sep 24 15:18:36 2013 ARC1 started with pid=81, OS id=10805 Tue Sep 24 15:18:36 2013 ARC2 started with pid=84, OS id=10807 Tue Sep 24 15:18:36 2013 ARC3 started with pid=87, OS id=10809 Arc1:archival started Arc2:archival started Arc2:becoming the ' no FAL ' ARCH Arc2:becoming the ' no SRL ' ARCH Arc1:becoming the Heartbeat ARCH Error 1031 received logging on to the standby PING[ARC1]: Heartbeat failed to connect to standby ' Q9ADGDG '. Error is 1031. Arc3:archival started Arc0:starting ARCH Processes COMPLETE Archived Log Entry 259135 added for thread 1 sequence 47156 ID 0x354620c2 dest 1: Error 1031 received logging on to the standby Fal[server, ARC3]: Error 1031 Creating remote Archivelog file ' Q9ADGDG ' Fal[server, ARC3]: FAL archive failed, and then trace file. Arch:fal Archive failed. Archiver continuing ORACLE Instance q9db1-archival Error. Archiver continuing. Tue Sep 24 15:18:46 2013 Opidrv aborting process O001 ospid (9605) as a result of ORA-28 Tue Sep 24 15:18:46 2013 Opidrv aborting process O000 ospid (10813) as a result of ORA-28 Tue Sep 24 15:18:46 2013 Immediate Kill session#: 2909, serial#: 369 Immediate Kill session:sess:0x24226c7200 OS pid:9091 Immediate Kill session#: 3380, serial#: 30271 Immediate Kill session:sess:0x2422782c58 OS pid:10265 Immediate Kill session#: 3597, serial#: 11109 Immediate Kill session:sess:0x24c27cf498 OS pid:10267 Tue Sep 24 15:20:14 2013 Restarting dead background process DIAG Tue Sep 24 15:20:14 2013 DIAG started with pid=64, OS id=20568 Restarting dead background process PING Tue Sep 24 15:20:14 2013 PING started with pid=68, OS id=20570 Restarting dead background process LMHB Tue Sep 24 15:20:14 2013 LMHB started with pid=70, OS id=20572 Restarting dead background process SMCO ............ Tue Sep 24 15:23:13 2013 arc0:detected ARCH Process Failure Tue Sep 24 15:23:13 2013 Thread 1 advanced to log sequence 47158 (LGWR switch) Current log# seq# 47158 mem# 0: +data/q9db/onlinelog/group_10.1240.818697423 Arc0:starting ARCH Processes Arc0:starting ARCH Processes COMPLETE Arc0:becoming the Heartbeat ARCH Arch:archival stopped, error occurred. would continue retrying ORACLE Instance Q9db1-archival Error Ora-00028:your session has been killed |
View the ARCN process
The code is as follows |
Copy Code |
[Oracle@q9db01 ~]$ Ps-ef|grep Ora_ar Oracle 20718 12870 0 22:07 pts/14 00:00:00 grep ora_ar [Oracle@q9db01 ~]$ Ps-ef|grep Ora_ar Oracle 25998 12870 0 22:07 pts/14 00:00:00 grep ora_ar |
Basically, because the client's system starts at 15:15 because of the middleware exception, which causes a large number of sessions to connect to the database, then the DBA, in order to prevent other businesses from being affected, then starts to kill a lot of system processes by using alter system killing sessions, Includes the ARCN (0,1,2,3) process, where ARCN processes fail to start properly for some reason, redo cannot be archived, all redo groups are filled with the system that is hang, and the system has caused the instance itself to be abnormal due to a large number of kill sessions ( Normal situation ARCN after the process kill automatically restart), processing scheme: First increase redo group with the time manual archiving, waiting for business low peak Restart Node 1, solve the problem