Oracle mistakenly kills the process and causes rac hang to stay in the solution

Last Update:2013-11-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article will introduce you to your experience. When using ipve, rac hang is suspended due to the mistaken killing process, next, I will analyze the cause and provide a reference for the solution.

If the customer feedback system hang is in place and cannot be archived, we need urgent intervention for analysis.
Node 1 log
When redo logs are not archived, the redo logs are fully written, and the alter system archive log current is manually executed, the database starts to ARCHIVE all the redo logs, but the subsequent redo logs cannot be archived, when all the redo logs are full, there are a large number of log file switches (archiving needed) waiting in the database.

The Code is as follows:

Copy code

Tue Sep 24 22:05:37 2013
Thread 1 advanced to log sequence 47282 (LGWR switch)
Current log #6 seq #47282 mem #0: + DATA/q9db/onlinelog/group_6.1244.818697409
Tue Sep 24 22:07:31 2013
ORACLE Instance q9db1-Can not allocate log, archival required
Thread 1 cannot allocate new log, sequence 47283
All online logs needed archiving
Current log #6 seq #47282 mem #0: + DATA/q9db/onlinelog/group_6.1244.818697409
Tue Sep 24 22:28:17 2013
ALTER SYSTEM ARCHIVE LOG
Archived Log entry 259646 added for thread 1 sequence 47266 ID 0x354620c2 dest 1:
Tue Sep 24 22:28:18 2013
Thread 1 advanced to log sequence 47283 (LGWR switch)
Current log #7 seq #47283 mem #0: + DATA/q9db/onlinelog/group_7.1243.818697415
Archived Log entry 259647 added for thread 1 sequence 47267 ID 0x354620c2 dest 1:
Archived Log entry 259648 added for thread 1 sequence 47268 ID 0x354620c2 dest 1:
Archived Log entry 259649 added for thread 1 sequence 47269 ID 0x354620c2 dest 1:
Archived Log entry 259650 added for thread 1 sequence 47270 ID 0x354620c2 dest 1:
Archived Log entry 259651 added for thread 1 sequence 47271 ID 0x354620c2 dest 1:
Archived Log entry 259652 added for thread 1 sequence 47272 ID 0x354620c2 dest 1:
Tue Sep 24 22:28:28 2013
Archived Log entry 259653 added for thread 1 sequence 47273 ID 0x354620c2 dest 1:
Archived Log entry 259654 added for thread 1 sequence 47274 ID 0x354620c2 dest 1:
Archived Log entry 259655 added for thread 1 sequence 47275 ID 0x354620c2 dest 1:
Archived Log entry 259656 added for thread 1 sequence 47276 ID 0x354620c2 dest 1:
Archived Log entry 259657 added for thread 1 sequence 47277 ID 0x354620c2 dest 1:
Archived Log entry 259658 added for thread 1 sequence 47278 ID 0x354620c2 dest 1:
Archived Log entry 259659 added for thread 1 sequence 47279 ID 0x354620c2 dest 1:
Tue Sep 24 22:28:39 2013
Archived Log entry 259660 added for thread 1 sequence 47280 ID 0x354620c2 dest 1:
Archived Log entry 259661 added for thread 1 sequence 47281 ID 0x354620c2 dest 1:
Archived Log entry 259662 added for thread 1 sequence 47282 ID 0x354620c2 dest 1:
Tue Sep 24 22:29:39 2013
Thread 1 advanced to log sequence 47284 (LGWR switch)
Current log #8 seq #47284 mem #0: + DATA/q9db/onlinelog/group_8.1242.818697417
Tue Sep 24 22:31:18 2013
Thread 1 advanced to log sequence 47285 (LGWR switch)
Current log #16 seq #47285 mem #0: + DATA/q9db/onlinelog/group_16.1884.827003545
Thread 1 advanced to log sequence 47286 (LGWR switch)
Current log #17 seq #47286 mem #0: + DATA/q9db/onlinelog/group_17.1885.827003587

Node 2 log
A large number of IPC Send timeout occurs in node 2.

The Code is as follows:

Copy code

Tue Sep 24 15:22:19 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
............
Tue Sep 24 18:51:55 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
Tue Sep 24 18:57:54 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
Explorer: inst 1 binc 464003926 ospid 1566
Tue Sep 24 19:03:57 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
Explorer: inst 1 binc 464003926 ospid 1566
Tue Sep 24 19:09:53 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
............
Tue Sep 24 20:22:00 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]

Because Node 1 cannot archive hang, node 2 will then be hang. Perform systemstate dump on the two nodes when the node is 1 hang, and use ass for analysis to obtain the records of Node 1 and node 2 as follows:
Node 1

The Code is as follows:

Copy code

393: waiting for 'Log file switch (archiving needed )'
394: waiting for 'Log file switch (archiving needed )'
Cmd: Insert
395: waiting for 'Log file switch (archiving needed )'
Cmd: Insert
397: waiting for 'Log file switch (archiving needed )'
Cmd: Insert
398: waiting for 'Log file switch (archiving needed )'
Cmd: Insert
451: waiting for 'SQL * Net message from client'
469: waiting for 'Log file switch (archiving needed )'
Cmd: Insert
470: waiting for 'Log file switch (archiving needed )'
Cmd: Insert
471: waiting for 'Log file switch (archiving needed )'
Cmd: Insert
618: waiting for 'Log file switch (archiving needed )'
Cmd: Insert
626: waiting for 'Log file switch (archiving needed )'
Cmd: Insert

NO BLOCKING PROCESSES FOUND

Node 2

The Code is as follows:

Copy code

515: waiting for 'gc buffer busy acquire'
Cmd: Insert
516: waiting for 'gc buffer busy acquire'
Cmd: Insert
517: waiting for 'gc buffer busy acquire'
Cmd: Insert
518: waiting for 'gc buffer busy acquire'
Cmd: Insert
519: waiting for 'gc buffer busy acquire'
Cmd: Insert
520: waiting for 'gc buffer busy acquire'
Cmd: Select
521: waiting for 'gc current request'
Cmd: Insert
522: waiting for 'enq: TX-row lock contention '[enq TX-00BA0020-001E3E3C]
Cmd: Select
523: waiting for 'gc buffer busy acquire'
Cmd: Insert
524: waiting for 'SQL * Net message from client'
525: waiting for 'gc buffer busy acquire'
Cmd: Insert
526: waiting for 'gc buffer busy acquire'
Cmd: Insert
527: waiting for 'enq: TX-row lock contention '[enq TX-00BA0020-001E3E3C]
Cmd: Select
528: waiting for 'SQL * Net message from client'
529: waiting for 'gc buffer busy acquire'
Cmd: Select

Resource Holder State
Enq TX-0005001E-0022374F 223: waiting for 'gc current request'
Enq TX-0047001B-002BCEB2 247: waiting for 'gc current request'
Enq TX-015B001E-000041FF 330: waiting for 'gc current request'
Enq TX-00010010-002EA7CD 179: waiting for 'gc current request'
Enq TX-00BA0020-001E3E3C ??? Blocker

Object Names
~~~~~~~~~~~~
Enq TX-0005001E-0022374F
Enq TX-0047001B-002BCEB2
Enq TX-015B001E-000041FF
Enq TX-00010010-002EA7CD
Enq TX-00BA0020-001E3E3C

Here, we can understand that many of the hang transactions in node 2 are due to the gc current request, and the waiting is because Node 1 cannot be archived, and some blocks cannot be transmitted to node 2 normally, as a result, hang of Node 2 remains here, and IPC Send timeout occurs. Transaction blocking or even hang on node 1 is caused by the failure of archiving. the problem to be located here is why Node 1 cannot be archived.
Continue to analyze Node 1 alert logs

The Code is as follows:

Copy code

Tue Sep 24 15:18:20 2013
Opidrv aborting process O000 ospid (7332) as a result of ORA-28
Immediate Kill Session #: 1904, Serial #: 1065
Immediate Kill Session: sess: 0x24a2522a38 operating pid: 7338
Immediate Kill Session #: 3597, Serial #: 11107
Immediate Kill Session: sess: 0x24c27cf498 OS pid: 7320
Tue Sep 24 15:18:23 2013
Opidrv aborting process W000 ospid (7980) as a result of ORA-28
Tue Sep 24 15:18:23 2013
Opidrv aborting process W001 ospid (8560) as a result of ORA-28
Tue Sep 24 15:18:35 2013
LGWR: Detected ARCH process failure
LGWR: Detected ARCH process failure
LGWR: Detected ARCH process failure
LGWR: Detected ARCH process failure
LGWR: STARTING ARCH PROCESSES
Tue Sep 24 15:18:35 2013
ARC0 started with pid = 66, OS id = 10793
Tue Sep 24 15:18:35 2013
Errors in file/u01/app/oracle/diag/rdbms/q9db/q9db1/trace/q9db1_nsa2_12635.trc:
ORA-00028: your session has been killed
LNS: Failed to archive log 8 thread 1 sequence 47156 (28)
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
Thread 1 advanced to log sequence 47157 (LGWR switch)
ARC0: STARTING ARCH PROCESSES
Current log #9 seq #47157 mem #0: + DATA/q9db/onlinelog/group_9.1241.818697421
Tue Sep 24 15:18:36 2013
ARC1 started with pid = 81, OS id = 10805
Tue Sep 24 15:18:36 2013
ARC2 started with pid = 84, OS id = 10807
Tue Sep 24 15:18:36 2013
ARC3 started with pid = 87, OS id = 10809
ARC1: Archival started
ARC2: Archival started
ARC2: Becoming the 'no fal' ARCH
ARC2: Becoming the 'no srl' ARCH
ARC1: Becoming the heartbeat ARCH
Error 1031 removed ed logging on to the standby
PING [ARC1]: Heartbeat failed to connect to standby 'q9adgdg'. Error is 1031.
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Archived Log entry 259135 added for thread 1 sequence 47156 ID 0x354620c2 dest 1:
Error 1031 removed ed logging on to the standby
FAL [server, ARC3]: Error 1031 creating remote archivelog file 'q9adgdg'
FAL [server, ARC3]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance q9db1-Archival Error. Archiver continuing.
Tue Sep 24 15:18:46 2013
Opidrv aborting process O001 ospid (9605) as a result of ORA-28
Tue Sep 24 15:18:46 2013
Opidrv aborting process O000 ospid (10813) as a result of ORA-28
Tue Sep 24 15:18:46 2013
Immediate Kill Session #: 2909, Serial #: 369
Immediate Kill Session: sess: 0x24226c7200 OS pid: 9091
Immediate Kill Session #: 3380, Serial #: 30271
Immediate Kill Session: sess: 0x2422782c58 OS pid: 10265
Immediate Kill Session #: 3597, Serial #: 11109
Immediate Kill Session: sess: 0x24c27cf498 OS pid: 10267
Tue Sep 24 15:20:14 2013
Restarting dead background process DIAG
Tue Sep 24 15:20:14 2013
DIAG started with pid = 64, OS id = 20568
Restarting dead background process PING
Tue Sep 24 15:20:14 2013
PING started with pid = 68, OS id = 20570
Restarting dead background process LMHB
Tue Sep 24 15:20:14 2013
LMHB started with pid = 70, OS id = 20572
Restarting dead background process SMCO
............
Tue Sep 24 15:23:13 2013
ARC0: Detected ARCH process failure
Tue Sep 24 15:23:13 2013
Thread 1 advanced to log sequence 47158 (LGWR switch)
Current log #10 seq #47158 mem #0: + DATA/q9db/onlinelog/group_10.1240.818697423
ARC0: STARTING ARCH PROCESSES
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC0: Becoming the heartbeat ARCH
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance q9db1-Archival Error
ORA-00028: your session has been killed

View ARCn Processes

The Code is as follows:	Copy code
[Oracle @ q9db01 ~] $ Ps-ef \| grep ora_ar Oracle 20718 12870 0 00:00:00 pts/14 grep ora_ar [Oracle @ q9db01 ~] $ Ps-ef \| grep ora_ar Oracle 25998 12870 0 00:00:00 pts/14 grep ora_ar

It is basically clear here that the customer's system has been connected to the database since due to exceptions in the middleware program, resulting in a large number of sessions. In order to prevent other services from being affected, the dba, then, a large number of system processes, including the ARCn (,) process, were killed by mistake through alter system kill sessions. The subsequent ARCn process cannot be started properly for some reason, resulting in redo being unable to archive, all the redo groups are written into the system, that is, the hang. The system has caused the instance to be abnormal due to a large number of kill sessions (normally, the ARCn process will automatically restart after kill). solution: add a redo group to work with scheduled manual archiving, and wait for the business to restart Node 1 during off-peak hours to solve the problem.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Oracle mistakenly kills the process and causes rac hang to stay in the solution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Oracle mistakenly kills the process and causes rac hang to stay in the solution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support