現在它還提供了新的低成本1/8機架配置。 藉助Oracle Exadata,客戶可通過整合降低IT成本,使儲存能力增加10倍,提升所有應用的效能,因消除系統整合的試錯過程而加快上市速度,以及即時做出更好的業務決策
某客戶的exadata環境,由於大量硬體故障,導致資料庫宕機。經過分析確認是由於某個cell節點的某個disk出現異常,導致部分
cell節點的disk 離線後無法進行asm reblance;最終導致diskgroup 被強制dismount了。如下是alert log
Fri Oct 09 15:44:26 2015
NOTE: process _user4655_+asm1 (4655) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
ERROR: Disk 18 cannot be offlined, since all the disks [18, 49] with mirrored data would be offline.
ERROR: too many offline disks in PST (grp 2)
GMON checking disk modes for group 1 at 55 for pid 41, osid 4655
Fri Oct 09 15:44:26 2015
NOTE: cache dismounting (not clean) group 2/0x4AD8D253 (DBFS_DG)
ERROR: too many offline disks in PST (grp 1)
NOTE: checking PST for grp 1 done.
WARNING: Disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 46/0xe968239f, mode = 0x6a, op = 4
GMON updating disk modes for group 1 at 56 for pid 41, osid 4655
ERROR: Disk 46 cannot be offlined, since all the disks [46, 72] with mirrored data would be offline.
ERROR: too many offline disks in PST (grp 1)
WARNING: Offline of disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 and mode 0x7f failed on ASM inst 1
WARNING: Disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 mode 0x7f offline is being aborted
Fri Oct 09 15:44:26 2015
NOTE: process _user4663_+asm1 (4663) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
WARNING: Offline of disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 and mode 0x7f failed on ASM inst 1
Fri Oct 09 15:44:26 2015
NOTE: halting all I/Os to diskgroup 2 (DBFS_DG)
NOTE: unable to offline disks after getting write error for diskgroup DBFS_DG
NOTE: write to disk 32 succeeded
NOTE: disk 18 had IO error
NOTE: write to disk 6 succeeded
NOTE: halting all I/Os to diskgroup 1 (DATA_KLYX)
NOTE: LGWR doing non-clean dismount of group 1 (DATA_KLYX)
NOTE: LGWR sync ABA=21.5465 last written ABA 21.5465
GMON checking disk modes for group 1 at 57 for pid 44, osid 4663
ERROR: too many offline disks in PST (grp 1)
NOTE: checking PST for grp 1 done.
我們可以看到,cell 2節點的第10號disk存在異常,通過cellcli命令列也可以進一步確認:
CellCLI> list physicaldisk;
20:0 E07Q5N normal
20:1 E05SMD normal
20:2 E07GV6 normal
20:3 E060YD normal
20:4 E05ZQE normal
20:5 E07L9M normal
20:6 E07FED normal
20:7 E05N88 normal
20:8 E08DY5 normal
20:9 E06ADQ normal
20:10 E08E5C critical
20:11 E08DYF normal
通過對該磁碟進行修複檢查,確實也發現了壞道,如下所示:
SEAGATE ST360057SSUN600G --> SEAGATE ST360057SSUN600G
Sector 0 --> Sector 0
1,172,123,568 Sectors
Sectors that could not be read:
102,578,137
102,578,139
102,578,140
102,578,141
102,578,142
102,578,149
102,578,152
......
1,172,123,561 sector(s) successfully copied.
7 bad source sectors encountered.
Corresponding destination sectors filled with: UNREADABLESECTOR
修複完畢後,成功mount diskgroup。但是最後檢查資料庫存在部分資料壞塊,如下所示:
SQL> select * from V$DATABASE_BLOCK_CORRUPTION;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE
---------- ---------- ---------- ------------------ -----------------
1 5186816 3 0 CORRUPT
1 5186813 1 0 CORRUPT
1 3343734 2 0 CORRUPT
25 2391372 1 1.4561E+13 CORRUPT
35 538916 1 0 CORRUPT
36 543804 3 0 CORRUPT
對於資料壞塊的修複相對比較簡單,只是由於是asm環境處理起來相對麻煩一些,加上客戶沒有備份,所以無法進行blockcover。對於system的幾個資料壞塊,經過檢查發現為審計對象,因此通過truncate然後insert資料即可格式化掉;
另外第25號檔案的資料壞塊為邏輯壞塊,通過查詢確認為index,直接rebuild index即可。
至於說最後第第35,36號檔案,由於均為資料檔案,涉及的對象均為分區表的partiiton,因此處理相對麻煩一些。
這裡我通過建立測試表,然後清理掉資料,構造一個資料空塊,然後將該資料區塊複製到asm diskgroup中替換到這幾個壞塊即可,注意替換之前,需要修改資料壞塊的rdba地址和obj id資訊,如下是資料區塊的複製替換過程:
SQL> @copy_block_to_asm.sql
Enter value for file_with_patched_block: /u01/.../blk.dbf
old 15: v_FsFileName := '&file_with_patched_block';
new 15: v_FsFileName := '/u01/.../blk.dbf';
Enter value for file_to_patch_in_asm: +DATA_KLYX/xxxx1/datafile/epmrange107.dbf
old 16: v_AsmFileName := '&file_to_patch_in_ASM';
new 16: v_AsmFileName := '+DATA_KLYX/xxxx1/datafile/epmrange107.dbf';
Enter value for block_to_patch: 543804
old 17: v_offstart := '&block_to_patch';
new 17: v_offstart := '543804';
PL/SQL procedure successfully completed.
最後通過rman檢查,確認一切正常。如下是rman的檢測結果:
.......
RMAN> backup validate datafile 36;
Starting backup at 16-OCT-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1337 instance=xxx1 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00036 name=+DATA_KLYX/xxx1/datafile/epmrange107.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
36 OK 0 315258 3276800 14561048845049
File Name: +DATA_KLYX/xxx1/datafile/epmrange107.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 2926354
Index 0 20
Other 0 35168
......