hanganalyze 3 分析總結(測),

來源:互聯網
上載者:User

hanganalyze 3 分析總結(測),
hanganalyze 3 分析總結
一種是會話層級的:
ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level <level>'; 

一種是執行個體層級:
ORADEBUG hanganalyze <level> 


一種是叢集範圍的:
ORADEBUG setmypid 
ORADEBUG setinst all 
ORADEBUG -g def hanganalyze <level> 


先解釋下各個level的含義: 

1-2:只有hanganalyze輸出,不dump任何進程
3:Level2+Dump出在IN_HANG狀態的進程   【最簡單,最容易判斷問題】
4:Level3+Dump出在等待鏈裡面的blockers(狀態為LEAF/LEAF_NW/IGN_DMP)
5:Level4+Dump出所有在等待鏈中的進程(狀態為NLEAF)

hanganalyze報告會分作許多片斷,會話片斷資訊總是由一個header詳盡描述被提取的的會話資訊。Oracle8i和9i的資訊略有不同:

sid是 Session ID
sess_srno是serial#
proc_ptr是Process Pointer
ospid 是OS Process ID
cnode是Node Id,Oracle9i才用
Nodenum是hanganalyze
自己為了記錄這些會話而定製的編號,從0開始排起。
State 是node的狀態
Adjlist是臨近的node(通常代表一個blocker node)
Predecessor是Predecessor node ,通常代表一個 waiter node 

 
接著解釋一下比較重要的一些node state:
IN_HANG:這表示該node處於死結狀態,通常還有其他node(blocker)也處於該狀態
LEAF/LEAF_NW:該node通常是blocker。通過條目的”predecessor”列可以判斷這個node是否是blocker。
LEAF說明該NODE沒有等待其他資源,而
LEAF_NW則可能是沒有等待其他資源或者是在使用CPU.
如下的執行個體說明了node16阻塞了node19的資源:
 
nodenum]/cnode/sid/sess_srno/session/
ospid/state/start/finish/[adjlist]/predecessor 
[16]/0/17/154/0x24617be0/26800/LEAF/29/30//19 
[19]/0/20/13/0x24619830/26791/NLEAF/33/34/[16]/186 


NLEAF:通常可以看作這些會話是被阻塞的資源。發生這種情況一般說明資料庫發生效能問題而不是資料庫hang
IGN/IGN_DMP:這類會話通常被認為是空閑會話,除非其adjlist列裡存在node。
如果是非空閑會話則說明其adjlist裡的node正在等待其他node釋放資源。
SINGLE_NODE/SINGLE_NODE_NW:近似於空閑會話
 
實戰hanganalyze:
 

測試案例

C:\Users\Administrator>SQLPLUS /  AS SYSDBA 
SQL*Plus: Release 10.2.0.4.0 - Production on Sat May 28 15:30:03 2011
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.
資料庫已經hang 了登入失敗下面來做個 systemdump 吧 看看是什麼原因


C:\Users\Administrator>sqlplus -prelim '/ as sysdba'
SQL*Plus: Release 10.2.0.4.0 - Production on Sat May 28 15:31:13 2011
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.
Enter password:
SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit;
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug tracefile_name
c:\oracle\product\10.2.0\admin\systex\udump\systex_ora_4636.trc
SQL> oradebug hanganalyze 3


Hang Analysis in c:\oracle\product\10.2.0\admin\systex\udump\systex_ora_6824.trc


先來看看 hanganalyze 的trace 
Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/139/105/0x18263170/2932/latch: shared pool>   //// 看來是 139 阻塞了一下的會話了.
 -- <0/164/1/0x1825f030/4592/latch: library cache>
Other chains found:


Chain 2 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/135/1/0x18269350/3956/Streams AQ: waiting for time man>
Chain 3 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/142/2/0x18266260/3816/Streams AQ: qmn slave idle wait>
Chain 4 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/146/90/0x182641c0/5944/latch: shared pool>
Chain 5 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/151/9/0x18265a38/6060/Streams AQ: qmn coordinator idle>
Chain 6 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/159/1138/0x18262948/6100/latch: shared pool>
Chain 7 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/161/1/0x182608a8/4856/latch: shared pool>
Chain 8 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/162/1/0x18260080/5904/latch: shared pool>
OK ,我們在 systemdump 裡面搜尋  2932


O/S info: user: SYSTEM, term: WILLSON-LI, ospid: 2932 
    OSD pid info: Windows threadid: 2932, image: ORACLE.EXE (SHAD)
Dump of memory from 0x000007FF18249EC0 to 0x000007FF1824A0C8
7FF18249EC0 00000006 00000000 17615728 000007FF  [........(Wa.....]
7FF18249ED0 00000010 000313A7 183ADF60 000007FF  [........`.:.....]
7FF18249EE0 00000003 000313A7 1798B0E8 000007FF  [................]
7FF18249EF0 0000000B 000313A7 1836A1C0 000007FF  [..........6.....]
7FF18249F00 00000004 0003129B 16835088 000007FF  [.........P......]
7FF18249F10 00000007 000313A7 16836020 000007FF  [........ `......]
7FF18249F20 00000007 000313A7 00000000 00000000  [................]
7FF18249F30 00000000 00000000 00000000 00000000  [................]
        Repeat 24 times
7FF1824A0C0 00000000 00000000                    [........]        
    (FOB) flags=2 fib=000007FF169BB290 incno=0 pending i/o cnt=0
     fname=C:\ORACLE\PRODUCT\10.2.0\ORADATA\SYSTEX\CONTROL04.CTL
     fno=0 lblksz=16384 fsiz=430
    (FOB) flags=2 fib=000007FF169BB630 incno=0 pending i/o cnt=0
     fname=C:\ORACLE\PRODUCT\10.2.0\ORADATA\SYSTEX\SYSTEM01.DBF
     fno=1 lblksz=8192 fsiz=64000
    ----------------------------------------
    SO: 000007FF1836A1C0, type: 4, owner: 000007FF18263170, flag: INIT/-/-/0x00
    (session) sid: 139 trans: 000007FF16ADE418, creator: 000007FF18263170, flag: (100041) USR/- BSY/-/-/-/-/-
              DID: 0001-0010-00000029, short-term DID: 0000-0000-00000000
              txn branch: 0000000000000000
              oct: 0, prv: 0, sql: 0000000000000000, psql: 000007FF17EABB40, user: 0/SYS
    service name: SYS$USERS
    O/S info: user: WILLSON-LI\Administrator, term: WILLSON-LI, ospid: 5588:5372, machine: WORKGROUP\WILLSON-LI
              program: sqlplus.exe
    application name: sqlplus.exe, hash value=254292535
   waiting for 'latch: shared pool' blocking sess=0x0000000000000000 seq=273 wait_time=0 seconds since wait started=129
                address=c972f38, number=d6, tries=0    ////--- 在等待 latch    c972f38
    Dumping Session Wait History
     for 'SQL*Net message from client' count=1 wait_time=61829702
                driver id=42455100, #bytes=1, =0
     for 'SQL*Net message to client' count=1 wait_time=3
                driver id=42455100, #bytes=1, =0
     for 'ksdxexeotherwait' count=1 wait_time=2
                =0, =0, =0
     for 'ksdxexeotherwait' count=1 wait_time=2
省略... 一直在等待 ksdxexeotherwait
---我們再搜尋  c972f38
SO: 000007FF18263170, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
  (process) Oracle pid=16, calls cur/top: 000007FF183ADF60/000007FF183ADF60, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=20
        Location from where call was made: kghalo: 
      waiting for c972f38 Child shared pool level=7 child#=1 
        Location from where latch is held: kghupr1: Chunk Header
        Context saved from call: 343209960
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           11 (283, 1306567910, 283)
           15 (268, 1306567910, 268)
           10 (268, 1306567910, 268)
           16 (129, 1306567910, 129)
           23 (120, 1306567910, 120)
           24 (108, 1306567910, 108)
           waiter count=6
          gotten 586960 times wait, failed first 238 sleeps 8
          gotten 0 times nowait, failed: 0
      on wait list for c972f38----------------------------------------------
     holding    (efd=7) 15e479a0 Child library cache level=5 child#=2 
        Location from where latch is held: kglobpn: child:: latch
        Context saved from call: 1
        state=busy, wlstate=free


可以看出來 SO : 000007FF18263170  holding 了  Child library cache  在等待  Child shared pool


SQL> oradebug poke 0xc972f38  4 0
BEFORE: [00C972F38, 00C972F3C) = 00000001
AFTER:  [00C972F38, 00C972F3C) = 00000000

其他:

1、10G以前,使用DBX,GDB之類的工具:
  找到一個前台進程的PID(ps -ef|grep LOCAL),然後用gdb $ORACLE_HOME/bin/oracle進行跟蹤
  oracle@gurufl02:~> ps -ef|grep LOCAL
  oracle   26358 26357  0 09:52 ?        00:00:00 oracleora9 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
  oracle   27848 27810  0 10:41 pts/3    00:00:00 grep LOCAL
  oracle@gurufl02:~> gdb $ORACLE_HOME/bin/oracle 26358


  然後執行:print ksudss(10)
  ( gdb) print ksudss(10)
  [Switching to Thread 4154533552 (LWP 26358)]
  warning: Unable to restore previously selected frame.
  $1 = 181773020
  Current language:  auto; currently asm
  這個時候在UDUMP或者BDUMP下會產生一個TRACE檔案。裡麵包含SYSTEM STATE DUMP
  除了通過print ksudss(10)進行systemstate dump,還可以進行下面的dump
  print ksdhng(3,1,0) 相當於oradebug hanganalyze 3
  print ksudps(10) 相當於oradebug dump processstate 10
  print curdmp() 相當於oradebug call curdmp(也就是oradebug dump cursordump)
  print ksdtrc(4) 相當於oradebug dump events 4(這裡參數表示level,1–session,2–process,4–system)
  print ksdsel(10046,12) –相當於為attach的進程設定10046事件level 12
  print skdxipc() –相當於oradebug ipc
  print skdxprst() –相當於oradebug procstat
  當然如果能用oradebug,就應該使用oradebug,畢竟方便得多,也更安全。


2、10G下sqlplus -prelim
如果10G,可以使用sqlplus -prelim選項強制登入
export ORACLE_SID=ora9                              
sqlplus -prelim '/ as sysdba'
oradebug setmypid
oradebug unlimit;
oradebug dump systemstate 10


3、如果上述都不行,可以使用pstack把堆棧資訊取出來,另外也可以使用truss, tusc,strace 之類的工具看看到底在等什麼
資料庫HANG住是計較頭痛的事情,如何找到HANG住的原因,是DBA必須面臨的課題




相關文章

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.