【轉自mos文章】資料庫 hang問題的診斷資訊收集方法

來源:互聯網
上載者:User

標籤:

資料庫 hang問題的診斷資訊收集方法

來源於:
How to Collect Diagnostics for Database Hanging Issues (文檔 ID 452358.1)


適用於:
Oracle Database - Enterprise Edition - Version 9.0.1.0 and later
Oracle Database - Personal Edition - Version 9.0.1.0 and later
Oracle Database - Standard Edition - Version 9.0.1.0 and later
Information in this document applies to any platform.

目標:
當一個資料庫hang時,為了確定hang的根本原因,進行資料庫的資訊收集是很用的。hang的根本原因經常被隔離並通過診斷資訊收集來解決。
作為一個可選的方案,如果不可行,我們可以使用收集到的資訊來防止將來再次方式此類事件。

Performance Service Request Diagnostic Collection (SRDC) documents

Service Request Data Collection (SRDC) documents have been specially designed to provide the reader with the necessary instructions to provide a step by step guide to collecting information for a various common Database Performance Issues.

Document 1938786.1 List of Available Database Performance Related SRDC Documents

 

What is needed to diagnose "Database Hang" issues?  為了診斷“database hang”問題, 什麼是需要做的?

 

資料庫hang的特指是:許多進程在等待一些其他activities 完成。典型情況是:有一個或者多個阻塞者,這些阻塞者被卡住(stuck)或者 正在努力工作並且沒有快速釋放資源。
為了判斷這個問題,需要做下面的診斷:
A.Hanganalyze and Systemstate Dumps
B.AWR/Statspack snapshots of General database performance
C.Up to date RDA

注意:多租戶資料庫中的Hang如果你正在運行多租戶資料庫,那麼你需要確定你遇到的hang問題是container層級的還是一個特定的pluggable database(PDB)層級的。Once this is established then connect to and collect the diagnostics only within the PDB that is experiencing the issue. 如果不明確在哪裡hang住了,那麼從診斷的觀點來說,最好是串連到Root container中收集診斷資訊,以便於所有PDB中的所有的processes被覆蓋到,而不會忽略掉有用的資訊。但是,如果你有大量的PDB,並且只是他們其中的一個 hang住了,那麼這回導致收集到很多不相關的資料。有了這個想法,請付出每一個努力以辨別是哪個database hang住了,並在那個hang住的database中收集資訊。

A. Dumps and Traces
Hanganalyze and Systemstate Dumps

Hanganalyze 和 Systemstate Dumps 提供了在某一個特定的時間點上,資料庫裡邊進程的資訊。Hanganalyze 提供了捲入到hang chain中的所有的process資訊,而systemstate 提供了資料庫裡邊所有的process的資訊。
當尋找一個潛在的hang situation的時候,你需要確定一個process是否被卡住(stuck)或者 緩慢的移動。通過連續收集兩個間隔的dump,這個可以被看出來。
如果一個process 被卡住了,這些traces會提供資訊以啟動更進一步的診斷,並協助提供解決方案。
  Hanganalyze 是一個概括,將會確認(confirm)db是否是真的hang住,還是只是慢(slow),並提供一個連續的snapshot
  Systemstate dump 顯示了資料庫中的process正在幹些什麼。
 
收集Hanganalyze and Systemstate Dumps

登陸進入系統:
使用下列命令登陸:

sqlplus '/ as sysdba'

若是如上方式串連資料庫有問題,那麼在10gR2以及更高版本中,可以使用 sqlplus preliminary connection方式:

sqlplus -prelim '/ as sysdba'

注意:從11.2.0.2開始到更高的版本,sqlplus "preliminary connection" 下的hanganalyze 不會產生輸出,因為,hanganalyze需要 a process state object and a session state object。
如果  hanganalyze被嘗試執行,雖然hanganalyze 表面上是成功的:

SQL>  oradebug hanganalyze 3Statement processed.

tracefile將會包括下面的輸出:

HANG ANALYSIS:ERROR: Can not perform hang analysis dump without a process state object and a session state object.( process=(nil), sess=(nil) )

For more about connecting with a preliminary connection, see:

Document 986640.1 How To Connect Using A Sqlplus Preliminary Connection

 

非rac時, Hanganalyze and Systemstate的收集命令

 

有些時候, database 實際上是很慢,並不是真正的hang。
It is therefore recommended,  where possible to get 2 hanganalyze and 2 systemstate dumps in order to determine whether processes are moving at all or whether they are "frozen".

Hanganalyze

sqlplus '/ as sysdba'oradebug setmypidoradebug unlimitoradebug hanganalyze 3-- Wait one minute before getting the second hanganalyzeoradebug hanganalyze 3oradebug tracefile_nameexit


 

Systemstate

sqlplus '/ as sysdba'oradebug setmypidoradebug unlimitoradebug dump systemstate 266oradebug dump systemstate 266oradebug tracefile_nameexit

 

rac時, Hanganalyze and Systemstate的收集命令

 

若是沒有打相關的patch,會有兩個影響rac的bug,make using level 266 or 267 very costly。因此,沒有打這兩個補丁的話,不推薦使用這些level

有關這些補丁的資訊,請見:

Document 11800959.8 Bug 11800959 - A SYSTEMSTATE dump with level >= 10 in RAC dumps huge BUSY GLOBAL CACHE ELEMENTS - can hang/crash instances Document 11827088.8 Bug 11827088 - Latch 'gc element' contention, LMHB terminates the instance 

注意: 這兩個bug 在11.2.0.3中被fix掉了。

 

當bug 11800959 和 bug 11827088被fix掉時,收集rac的Hanganalyze and Systemstate:

 For 11g:sqlplus '/ as sysdba'oradebug setorapname recooradebug  unlimitoradebug -g all hanganalyze 3oradebug -g all hanganalyze 3oradebug -g all dump systemstate 266oradebug -g all dump systemstate 266exit

 

當bug 11800959 和 bug 11827088沒有被fix掉時,收集rac的Hanganalyze and Systemstate:

sqlplus '/ as sysdba'oradebug setorapname recooradebug unlimitoradebug -g all hanganalyze 3oradebug -g all hanganalyze 3oradebug -g all dump systemstate 258oradebug -g all dump systemstate 258exit


 

對於10g,用 oradebug setmypid  來替代oradebug setorapname reco

sqlplus '/ as sysdba'oradebug setmypidoradebug unlimitoradebug -g all hanganalyze 3oradebug -g all hanganalyze 3oradebug -g all dump systemstate 258oradebug -g all dump systemstate 258exit


 

在rac環境中,一個dump file 被建立在所有的rac執行個體上的diag trace file中。

解釋Hanganalyze and Systemstate Levels

Hanganalyze levels:
  Level 3: In 11g onwards, level 3 also collects a short stack for relevant processes in hang chain

Systemstate levels:
  Level 258 is a fast alternative but we‘d lose some lock element data
  Level 267 can be used if additional buffer cache / lock element data is needed with an understanding of the cost

其他方法:
If connection to the system is not possible in any form, then please refer to the following article which describes how to collect systemstates in that situation:

Document 121779.1 Taking a SYSTEMSTATE dump when you cannot CONNECT to Oracle.

On RAC Systems, hanganalyze, systemstates and some other RAC information can be collected using the ‘racdiag.sql‘ script, see:

Document 135714.1 Script to Collect RAC Diagnostic Information (racdiag.sql)

Sometimes you may wish to collect systemstate information at the time that a particular error occurs. This can be done by setting an event in the session or system wide to trigger based upon the detection of an error. For example, if a hang was being encountered that was related to an ORA-00054 error, then you could capture a systemstate when the ORA-00054 occurs using the following command:

sqlplus '/ as sysdba'ALTER SYSTEM SET events '54 trace name systemstate level 258';

The next time an ORA-00054 is encountered, a systemstate will be dumped.

The tracing can be disabled with :

ALTER SYSTEM SET events '54 trace name context off';

You can also set such events in the spfile. See:

Document 160178.1 How To Set EVENTS In The SPFILE 


Note: be aware that this will produce a trace for every occurrence of the error.


v$wait_chain
從11gR1開始,dia0後台進程會啟動收集hanganalyze information 並把這些資訊存在記憶體中的 "hang analysis cache"區中。
Oracle每3秒做一次本地 hanganalyze information ,每10秒做一次 global (RAC) hanganalyze information.
當hang正在發生的時候,這個資訊可以提供一個對hang chain occurring的quick view

更多資訊,請見:

Document 1428210.1 Troubleshooting Database Contention With V$Wait_Chains


B. Provide AWR/Statspack snapshots of General database performance


Hangs are a visible effect of a number of potential causes, this can range from a single process issue to something brought on by a global problem.

Collecting information about the general performance of the database in the build up to, during and after the problem is of primary importance since these snapshots can help to determine the nature of the load on the database at these times and can provide vital diagnostic information. This may prove invaluable in identifying the area of the problem and ultimately resolving the issue.

To do this, please take and upload snapshot reports of database performance (AWR (or statspack) reports) immediately before, during and after the hang..
Please refer to the following article for details of what to collect:

Document 781198.1 Diagnostics for Database Performance Issues

 

C. Gather an up to date RDA

An up to date current RDA provides a lot of additional information about the configuration of the database and performance metrics and can be examined to spot background issues that may impact performance.

See the following note on My Oracle Support:

Document 314422.1 Remote Diagnostic Agent (RDA) 4 - Getting Started


 

Proactive Methods to gather information on a Hanging System

On some systems a hang can occur when the DBA is not available to run diagnostics or at times it may be too late to collect the relevant diagnostics. In these cases, the following methods may be used to gather diagnostics:

  • As an alternative to the manual collection method notes above, it is also possible to use the HANGFG script as described in the following note to collect the information:
    Document 362094.1 HANGFG User GuideAdditionally, this script can collect information with lower impact on the target database.
  • LTOM
    The Lite Onboard Monitor (LTOM) is a java program designed as a real-time diagnostic platform for deployment to a customer site.LTOM proactively provides real-time automatic problem detection and data collection.
    For more information see:
    Document 352363.1 LTOM - The On-Board Monitor User Guide
  • Procwatcher
    Procwatcher is a tool that examines and monitors Oracle database and/or clusterware processes at a specific interval
    The following notes explain how to use Procwatcher:
    Document 459694.1 Procwatcher: Script to Monitor and Examine Oracle DB and Clusterware Processes
    Document 1352623.1 How To Troubleshoot Database Contention With Procwatcher
  • OSWatcher contains a built in analyzer that allows the data that has been collected to be automatically analyzed, pro-actively looking for cpu, memory, io and network issues. It is recommended that all users install and run OSW since it is invaluable for looking at issues on the OS and has very little overhead. It can also be extremely useful for looking at OS performance degradation that may be seen when a hang situation occurs.

    Refer to the following for download, user guide and usage videos on OSWatcher:Document 301137.1 OSWatcher User Guide (Includes: [Video])

Oracle Enterprise Manager 12c Real-Time ADDM

Real-Time ADDM is a feature of Oracle Enterprise Manager Cloud Control 12c that allows you to analyze database performance automatically when you cannot logon to the database because it is hung or performing very slowly due to a performance issue. It analyzes current performance when database is hanging or running slow and reports sources of severe contention.

For more information see the following video:

Oracle Enterprise Manager 12c Real-Time ADDM

Retroactive Information Collection

Sometimes we may only notice a hang after it has occurred. In this case the following information may help with Root Cause Analysis:

  1. A series of AWR/Statspack reports leading up to and during the hang
  2. ASH reports - one can obtain more granular reports during the time of the hang - even up to
    one minute in time.
  3. Raw ASH information. This can be obtained by issuing an  ashdump trac. See:
    Document 243132.1 10g and above Active Session History (Ash) And Analysis Of Ash Online And Offline
    Document 555303.1 ashdump* scripts and post-load processing of MMNL traces
  4. Alert log and any traces created at time of hang
    On a RAC specifically check the following traces files as well: dia0, lmhb, diag and lmd0 traces
  5. RDA as above


 

【轉自mos文章】資料庫 hang問題的診斷資訊收集方法

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.