ASM心跳逾時檢測之--Delayed ASM PST heart beats
近日,連續收到ASM磁碟dismount,並且是錯誤“Waited 15 secs for write IO to PST”的問題,這是ASM特有的心跳逾時檢測,ASM instance會定期檢查每個asm disk是不是能正常反饋。所以決定針對這個問題,做個小總結。
在文檔ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (Doc ID 1581684.1) 中有下面一段描述:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generally this kind messages comes in ASM alertlog file on below situations,
Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
上面描述,可以理解為下面幾點:
1. ASM執行個體會定期檢查每一個磁碟組的磁碟狀態,是否通訊正常;
2. 這個檢查,只是針對normal和high冗餘模式,對於external冗餘,不會遇到這個錯誤;
3. 預設情況是15s逾時,也就是說15s磁碟組還是沒有對ASM執行個體響應的話,就會dismount磁碟組。
而遇到這個問題的客戶,都是使用光纖網路儲存,在儲存網路出現問題的情況下,會引發這個錯誤的出現。也就是說,在ASM定期發出檢查資訊的時候,如果磁碟沒有在15s內反饋的話,我就認為磁碟已經無法訪問。
針對這個錯誤,我嘗試在測試環境測試,由於測試環境是VMware的虛擬機器,在物理層面刪除磁碟,並不會引發這個問題。原因是在同一個主機上的磁碟被異常刪除後,ASM的讀取操作會立即返回系統層面的IO錯誤,而不需要去等待錯誤“Waited 15 secs for write IO to PST”的逾時。
所以,我總結這個錯誤,只會出現在共用的ASM磁碟,不在物理主機的本地,而是在儲存網路中,ASM發出去的檢測資訊,不能及時被反饋,才會出現這個錯誤。這時,可能是儲存主機,儲存網路,甚至儲存磁碟的問題,anyway,我ASM沒有收到我需要的確認資訊,我認為你有問題,如果有問題的磁碟數夠多,達到影響資料完整性了,那我ASM就要dismount這個磁碟組了。
這裡對於“Waited 15 secs for write IO to PST”錯誤資訊,根據文檔1581684.1介紹,是在11.2.0.3.0之後出現的。同時在文檔中有描述,如何手動修改這個檢測逾時的時間,可以通過參數_asm_hbeatiowait來控制:
alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';
<需要重啟ASM/CRS來時修改生效。>
為了確認,這個參數是在11.2.0.3之後出現的,我將全部資料庫版本都查詢一遍,具體可以參考下面資訊:
======================10.2=====================
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Prod
PL/SQL Release 10.2.0.5.0 - Production
CORE 10.2.0.5.0 Production
TNS for Linux: Version 10.2.0.5.0 - Production
NLSRTL Version 10.2.0.5.0 - Production
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;
hidden parameter value
-------------------------------------------------------------------------------- ----------
_asm_acd_chunks 1
_asm_allow_only_raw_disks TRUE
_asm_allow_resilver_corruption FALSE
_asm_ausize 1048576
_asm_blksize 4096
_asm_direct_con_expire_time 120
_asm_disk_repair_time 14400
_asm_droptimeout 60
_asm_emulmax 10000
_asm_emultimeout 0
_asm_fob_tac_frequency 3
hidden parameter value
-------------------------------------------------------------------------------- ----------
_asm_instlock_quota 0
_asm_kfdpevent 0
_asm_libraries ufs
_asm_maxio 1048576
_asm_skip_resize_check FALSE
_asm_stripesize 131072
_asm_stripewidth 8
_asm_wait_time 18
_asmlib_test 0
_asmsid asm
21 rows selected.
======================11.2.0.1=====================
sqlplus / as sysdba
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;
hidden parameter value
--------------------------------------------------------------------------------
_asm_hbeatwaitquantum 2
======================11.2.0.2=====================
$ sqlplus / as sysdba
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining
and Real Application Testing options
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;
hidden parameter value
--------------------------------------------------------------------------------
_asm_hbeatwaitquantum 2
在11.2.0.3.0之後才有這個參數出現,也就是說ASM執行個體對磁碟逾時的檢測是在11.2.0.3之後才出現的
======================11.2.0.3=====================
sys@R11203> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;
hidden parameter value
hidden parameter value
-------------------------------------------------- --------------------
_asm_hbeatiowait 15
_asm_hbeatwaitquantum 2
======================11.2.0.4=====================
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - Production
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;
hidden parameter value
-------------------------------------------------------------------------------- ---------
_asm_hbeatiowait 15 <<<<<<<<<<<<<<<<<<<<
_asm_hbeatwaitquantum 2
======================12.1.0.1=====================
$ sqlplus / as sysdba
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;
hidden parameter value
--------------------------------------------------------------------------------
_asm_hbeatiowait 15
_asm_hbeatwaitquantum 2
在12.1.0.2之後,這個參數預設值被調整為120s
======================12.1.0.2=====================
$ sqlplus / as sysdba
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;
hidden parameter value
--------------------------------------------------------------------------------
_asm_hbeatiowait 120
_asm_hbeatwaitquantum 2
希望總結的這個知識點,對你有協助。日常中,經常感歎,這個問題很簡單,但是不sure,測試過後,記錄下來,以備查詢。
Oracle ASM 如何添加新磁碟到磁碟
Oracle 10g 手工建立ASM資料庫
Ubuntu 12.04(amd64)安裝完Oracle 11gR2後各種問題解決方案
如何修改Oracle 10g ASM的sys密碼