一.故障處理過程
資料庫版本:Oracle10.2.0.5.4
作業系統: Redhat5.4
中午同事對錶添加欄位,然後造成70多個儲存等對象無效,系統不能正常運行,查看了一下同事添加欄位的表,並沒有鎖沒有釋放。
嘗試編譯無效對象,部分可以編譯,部分不能編譯,對象一直處於編譯狀態。Kill掉其他的session 後也是無法編譯,因為影響業務,決定重啟資料庫。
--關閉DB:
[oracle@qs-xezf-db1 ~]$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.5.0 - Productionon Thu Mar 29 13:23:14 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
Connected to:
Oracle Database 10g Enterprise EditionRelease 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Miningand Real Application Testing options
SYS@xezf(qs-xezf-db1)> alter systemswitch logfile;
System altered.
SYS@xezf(qs-xezf-db1)> alter systemswitch logfile;
System altered.
SYS@xezf(qs-xezf-db1)> alter systemswitch logfile;
System altered.
--個人習慣,關閉db前先切換下日誌。
SYS@xezf(qs-xezf-db1)> shutdownimmediate
--一直掛在這
--跟蹤LOG:
[oracle@qs-xezf-db1 bdump]$ tail -falert_xezf.log
Stopping background process QMNC
Thu Mar 29 13:24:08 CST 2012
Stopping background process CJQ0
Thu Mar 29 13:24:10 CST 2012
Stopping background process MMNL
Thu Mar 29 13:24:11 CST 2012
Stopping background process MMON
License high water mark = 201
Thu Mar 29 13:24:12 CST 2012
Job queue slave processes stopped
Thu Mar 29 13:29:11 CST 2012
Active call for process 11403 user 'oracle'program 'oracle@qs-xezf-db1'
System State dumped to trace file/u01/app/oracle/admin/xezf/udump/xezf_ora_13569.trc
Thu Mar 29 13:29:29 CST 2012
SHUTDOWN: waiting for active calls tocomplete.
Thu Mar 29 13:44:15 CST 2012
MMNL absent for 1207 secs; Foregroundstaking over
--等了20分鐘的時間,最後還是掛住了,在這個狀態等了10分鐘,還是不能結束。
決定強制關閉DB:
[oracle@qs-xezf-db1 ~]$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.5.0 - Productionon Thu Mar 29 13:53:13 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
Connected.
SQL> shutdown abort
ORACLE instance shut down.
--啟動DB:
SQL> startup
ORACLE instance started.
Total System Global Area 5251268608 bytes
Fixed Size 2103600 bytes
Variable Size 1275070160 bytes
Database Buffers 3959422976 bytes
Redo Buffers 14671872 bytes
Database mounted.
Database opened.
SQL>
--在編譯無效對象,系統復原正常。
二.分析過程2.1 分析system state dump
在關閉DB的時候,自動產生了一個system dump 檔案,使用ass.awk 分析:
Oracle 使用ass.awk 工具查看system state dump 說明
http://blog.csdn.net/tianlesoftware/article/details/7237729
[oracle@qs-xezf-db1 src]$ awk -f ass109.awk/u01/app/oracle/admin/xezf/udump/xezf_ora_13569.trc
Starting Systemstate 1
.............................................
Ass.Awk Version 1.0.9 - Processing/u01/app/oracle/admin/xezf/udump/xezf_ora_13569.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer'
3: waiting for 'rdbms ipc message'
4: waiting for 'rdbms ipc message'
5: waiting for 'rdbms ipc message'
6: waiting for 'rdbms ipc message'
7: waiting for 'rdbms ipc message'
8: last wait for 'rdbms ipc message'
9: last wait for 'smon timer'
10: waiting for 'rdbms ipc message'
14:
15:
17: waiting for 'rdbms ipc message'
18: waiting for 'rdbms ipc message'
23: waiting for 'SQL*Net message fromclient'
Cmd: PL/SQL Execute
26: last wait for 'ksdxexeotherwait'
38: waiting for 'SQL*Net message fromclient'
Cmd: PL/SQL Execute
39: waiting for 'SQL*Net message fromclient'
40: waiting for 'SQL*Net message fromclient'
41: waiting for 'SQL*Net message fromclient'
75: waiting for 'SQL*Net message fromclient'
88: waiting for 'SQL*Net message fromclient'
Cmd: PL/SQL Execute
95: waiting for 'SQL*Net message fromclient'
96: waiting for 'SQL*Net message fromclient'
97: waiting for 'SQL*Net message fromclient'
99: waiting for 'SQL*Net message fromclient'
104:waiting for 'SQL*Net message fromclient'
Cmd: PL/SQL Execute
106:waiting for 'SQL*Net message fromclient'
107:waiting for 'SQL*Net message fromclient'
108:waiting for 'SQL*Net message fromclient'
117:waiting for 'SQL*Net message fromclient'
Cmd: Update
129:waiting for 'SQL*Net message fromclient'
Cmd: Update
130:waiting for 'SQL*Net message fromclient'
133:waiting for 'SQL*Net message fromclient'
138:waiting for 'SQL*Net message fromclient'
140:last wait for 'db file sequential read'(1,974b,1)
Cmd: Insert
154:waiting for 'SQL*Net message fromclient'
Cmd: PL/SQL Execute
159:waiting for 'SQL*Net message fromclient'
160:waiting for 'SQL*Net message fromclient'
162:waiting for 'SQL*Net message fromclient'
171:waiting for 'SQL*Net message fromclient'
196:waiting for 'SQL*Net message fromclient'
204:waiting for 'SQL*Net message fromclient'
Cmd: PL/SQL Execute
216:waiting for 'SQL*Net message fromclient'
Cmd: PL/SQL Execute
220:waiting for 'SQL*Net message from client'[Latchfor]
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch for ??? Blocker
Object Names
~~~~~~~~~~~~
Latch for SHUTDOWN: waiting foractive calls to complete.
4767 Lines Processed.
[oracle@qs-xezf-db1 src]$
2.2 分析:SHUTDOWN: waiting for active calls tocomplete.
在此次shutdown過程中,在這一步停了近15分鐘。
Thu Mar 29 13:29:11 CST 2012
Active call for process 11403 user 'oracle'program 'oracle@qs-xezf-db1'
System State dumped to trace file/u01/app/oracle/admin/xezf/udump/xezf_ora_13569.trc
Thu Mar 29 13:29:29 CST 2012
SHUTDOWN: waiting for active calls tocomplete.
Thu Mar 29 13:44:15 CST 2012
MMNL absent for 1207 secs; Foregroundstaking over
MOS上解釋如下:
Alert Log: Shutdown Waiting for ActiveCalls to Complete [ID 1039389.6]
2.2.1 Problem Description
You areattempting to shut down the database and the database hangs. The
alert log contains the following message:
--我們嘗試關閉DB時,資料庫hang住了,然後alert log 就會出現如下資訊:
SHUTDOWN: Waiting for active calls to complete
There are noother error messages in the alert log.
並且沒有其他的錯誤資訊了。
2.2.2 Solution Description: --解決方案描述
--------------------------------------------------------
Caveat: This note is intended for Unix platforms primarily, but we shouldaddress this issue on Windows platforms as well. On Windows, stopping the service is a validworkaround for these errors.
--------------------------------------------------------
警告:這個方式針對Unix 平台,但是windows 平台同樣適用。
Locate and kill any client connections tothe database at the Unix level, as follows:
解決方案是在Unix 系統層級尋找並kill 掉任何用戶端的串連,具體方法如下:
1. Locate any client connections to thedatabase using ps, and grep for any processes belonging to this .
Example: ps -ef | grep SID
2. Look for processes that include a'Local=No' designation.
Example:osupport 6235 1 0 Nov 24 0:01 oracleSID (LOCAL=NO)
更多內容,參考:
Oracle 伺服器 進程中的 LOCAL=NO 和 LOCAL=YES
http://blog.csdn.net/tianlesoftware/article/details/5675404
3. Kill the Unix process(es) with the'Local=No' designation.
--kill 掉Local=no 的進程
Example: Kill -9 6235
2.2.3 Explanation –造成這個現象的原因
The database iswaiting for pmon to clean up processes, but pmon is unable to clean them. Theclient connections to the server are causing the shutdown immediate or normalto hang. Killing them allows pmon toclean up and release the associated Oracle processes and resources.
--資料庫等待PMON 進程清理這些LOCAL=NO的進程,但是PMON 進程不能清理他們,這些進程就會導致資料庫shutdown 時hang住,kill 掉這些進程,然後讓PMON 來清理和釋放其他的資料庫進程和資源,完成資料庫關閉操作。
關於PMON進程的更多內容,參考:
Oracle 後台進程 說明
http://blog.csdn.net/tianlesoftware/article/details/5587788
What resources are we talking about?
--組成DB 關閉的資源套件括:
1) Any non committed transactions must berolled back
2) Any temporary space (sort segments /lobs / session temporary tables) must be freed
3) The session itself and any associatedmemory consumed by the session.
4) Internal locks / enqueues must becleaned up
Often Oracle(SMON or PMON depending on whether Shared Server is used) will wait for the OSto terminate the process(es) associated with the session. I the OS never returns, or fails to terminatethem, then the instance shutdown will hang with this message (Shutdown Waitingfor Active Calls to Complete)
Other meansexist to achieve a quick shutdown, as outlined in Note386408.1 - What Is The Fastest Way To Cleanly Shutdown An Oracle Database?
Note that inE-Business Environments, the same messages can be produced if the databaseshutdown process starts before the concurrent manager stops in the applicationserver .
2.3 分析:MMNL absent for 1207 secs; Foregroundstaking over
在我遇到的這次故障中,我等待了10分鐘,依舊掛在這,之後就shutdown Abort db了。
MOS 說明:
"MMNL absent for %u secs; Foregroundstaking over" Messages in Alert.log [ID 465891.1]
When there is ahang situation for a long period of time, oracle writes this message tothe alert.log: "MMNL absent for %u secs; Foregrounds takingover".
--當資料庫hang住一段時間後,Oracle 就會寫這個資訊到alert log。
Thu Nov 1 15:20:56 2007
MMNL absent for 3805 secs; Foregrounds taking over
Thu Nov 1 15:58:13 2007
Starting ORACLE instance (normal)
If this is ahang situation during shutdown (after MMNL is stopped) or during databaseopen (when MMNL is not yet started), messages are written to the alert.logevery 5 minutes.
--在資料庫關閉過程中(MMNL 已經停止)或者在資料庫open時(MMNL 還沒有啟動),這個訊息會每隔5分鐘寫入一次alert log。
Any completehang situations which take a long time trigger these messages.
2.3.1 Cause –導致原因
These messagesare for informational purposes only to indicate a hang situation which lasted along period of time and MMNL is unable to run its job at that moment. When thehang lasts for several hours before the instance is completely stopped/killed,the tail of the alert.log may flood with these messages. Although the messagetext contains MMNL process, in this case MMNL is actually the victim of thehang issue rather than the cause.
--這個資訊唯一的作用是用來說明資料庫已經hang住了,並且MMNL 也不能執行它的job。
2.3.2 Solution --解決方案
In case of ahang taking a very long time, there is no way to prevent these messages to bewritten to alert.log. Hang situation must be resolved to stop the messages.Starting with 11.1 version, messages are written to a trace file rather thanthe alert.log to prevent messages to be written to alert.log repeatedly manytimes.
--在某些情況下,可能會hang住很長時間,並且沒有方法阻止資料庫往alert log裡寫入這些資訊。從Oracle 11gR1開始,該訊息將寫入tracefile,而不是alert log。從而避免重複的往alert log裡寫資訊。
三.小結
在我這次遇到的故障,有2個問題:
3.1 添加欄位導導致對象無效
因為系統正在運行,修改的對象也是核心表,所以造成了很多問題,這個歸根到底是規範問題,不多說。這裡為了盡量減少時間,我重啟了DB,但是重啟過程導中導致資料庫hang住了。
這裡可選的操作有hanganalyze,system dump,檢查鎖,等待情況:
Oracleoradebug 命令 使用說明
http://blog.csdn.net/tianlesoftware/article/details/6525628
OracleHANGANALYZE 功能診斷 DBhanging
http://blog.csdn.net/tianlesoftware/article/details/6321961
Oracle 一次 鎖表 處理小記
http://blog.csdn.net/tianlesoftware/article/details/6679014
Oracle 查看 對象 持有鎖的情況
http://blog.csdn.net/tianlesoftware/article/details/6822321
3.2 db 關閉過程中hang
我在操作中採用的解決方案是shutdown abort,MOS上提供的解決方案是在作業系統層級kill 掉LOCAL=NO的進程,因為這些進程不能被PMON進程清理導致shutdown 過程hang住。
-------------------------------------------------------------------------------------------------------
著作權,文章允許轉載,但必須以連結方式註明源地址,否則追究法律責任!
Skype: tianlesoftware
Email: tianlesoftware@gmail.com
Blog: http://www.tianlesoftware.com
Weibo: http://weibo.com/tianlesoftware
Twitter: http://twitter.com/tianlesoftware
Facebook: http://www.facebook.com/tianlesoftware
Linkedin: http://cn.linkedin.com/in/tianlesoftware
-------加群需要在備忘說明Oracle資料表空間和資料檔案的關係,否則拒絕申請----
DBA1 群:62697716(滿); DBA2 群:62697977(滿) DBA3 群:62697850(滿)
DBA 超級群:63306533(滿); DBA4 群:83829929 DBA5群: 142216823
DBA6 群:158654907 DBA7 群:172855474 DBA總群:104207940