oracle資料庫TB層級的Exadata資料庫恢複

來源:互聯網
上載者:User

某客戶的exadata環境,由於大量硬體故障,導致資料庫宕機。經過分析確認是由於某個cell節點的某個disk出現異常,導致部分

cell節點的disk 離線後無法進行asm reblance;最終導致diskgroup 被強制dismount了。如下是alert log:

Fri Oct 09 15:44:26 2015
NOTE: process _user4655_+asm1 (4655) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
ERROR: Disk 18 cannot be offlined, since all the disks [18, 49] with mirrored data would be offline.
ERROR: too many offline disks in PST (grp 2)
GMON checking disk modes for group 1 at 55 for pid 41, osid 4655
Fri Oct 09 15:44:26 2015
NOTE: cache dismounting (not clean) group 2/0x4AD8D253 (DBFS_DG)
ERROR: too many offline disks in PST (grp 1)
NOTE: checking PST for grp 1 done.
WARNING: Disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 46/0xe968239f, mode = 0x6a, op = 4
GMON updating disk modes for group 1 at 56 for pid 41, osid 4655
ERROR: Disk 46 cannot be offlined, since all the disks [46, 72] with mirrored data would be offline.
ERROR: too many offline disks in PST (grp 1)
WARNING: Offline of disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 and mode 0x7f failed on ASM inst 1
WARNING: Disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 mode 0x7f offline is being aborted
Fri Oct 09 15:44:26 2015
NOTE: process _user4663_+asm1 (4663) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
WARNING: Offline of disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 and mode 0x7f failed on ASM inst 1
Fri Oct 09 15:44:26 2015
NOTE: halting all I/Os to diskgroup 2 (DBFS_DG)
NOTE: unable to offline disks after getting write error for diskgroup DBFS_DG
NOTE: write to disk 32 succeeded
NOTE: disk 18 had IO error
NOTE: write to disk 6 succeeded
NOTE: halting all I/Os to diskgroup 1 (DATA_KLYX)
NOTE: LGWR doing non-clean dismount of group 1 (DATA_KLYX)
NOTE: LGWR sync ABA=21.5465 last written ABA 21.5465
GMON checking disk modes for group 1 at 57 for pid 44, osid 4663
ERROR: too many offline disks in PST (grp 1)
NOTE: checking PST for grp 1 done.

我們可以看到,cell 2節點的第10號disk存在異常,通過cellcli命令列也可以進一步確認:


CellCLI> list physicaldisk;
         20:0            E07Q5N          normal
         20:1            E05SMD          normal
         20:2            E07GV6          normal
         20:3            E060YD          normal
         20:4            E05ZQE          normal
         20:5            E07L9M          normal
         20:6            E07FED          normal
         20:7            E05N88          normal
         20:8            E08DY5          normal
         20:9            E06ADQ          normal
         20:10           E08E5C          critical
         20:11           E08DYF          normal
通過對該磁碟進行修複檢查,確實也發現了壞道,如下所示:


SEAGATE ST360057SSUN600G --> SEAGATE ST360057SSUN600G
Sector 0 --> Sector 0
1,172,123,568 Sectors
Sectors that could not be read:
102,578,137
102,578,139
102,578,140
102,578,141
102,578,142
102,578,149
102,578,152
......
1,172,123,561 sector(s) successfully copied.
7 bad source sectors encountered.
Corresponding destination sectors filled with: UNREADABLESECTOR

修複完畢後,成功mount diskgroup。但是最後檢查資料庫存在部分資料壞塊,如下所示:


SQL>  select * from  V$DATABASE_BLOCK_CORRUPTION;
 
     FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE
---------- ---------- ---------- ------------------ -----------------
         1    5186816          3                  0 CORRUPT
         1    5186813          1                  0 CORRUPT
         1    3343734          2                  0 CORRUPT
        25    2391372          1         1.4561E+13 CORRUPT
        35     538916          1                  0 CORRUPT
        36     543804          3                  0 CORRUPT

對於資料壞塊的修複相對比較簡單,只是由於是asm環境處理起來相對麻煩一些,加上客戶沒有備份,所以無法進行blockcover。對於system的幾個資料壞塊,經過檢查發現為審計對象,因此通過truncate然後insert資料即可格式化掉;
另外第25號檔案的資料壞塊為邏輯壞塊,通過查詢確認為index,直接rebuild index即可。
至於說最後第第35,36號檔案,由於均為資料檔案,涉及的對象均為分區表的partiiton,因此處理相對麻煩一些。

這裡我通過建立測試表,然後清理掉資料,構造一個資料空塊,然後將該資料區塊複製到asm diskgroup中替換到這幾個壞塊即可,注意替換之前,需要修改資料壞塊的rdba地址和obj id資訊,如下是資料區塊的複製替換過程:


SQL> @copy_block_to_asm.sql
Enter value for file_with_patched_block: /u01/.../blk.dbf
old  15: v_FsFileName := '&file_with_patched_block';
new  15: v_FsFileName := '/u01/.../blk.dbf';
Enter value for file_to_patch_in_asm: +DATA_KLYX/xxxx1/datafile/epmrange107.dbf
old  16: v_AsmFileName := '&file_to_patch_in_ASM';
new  16: v_AsmFileName := '+DATA_KLYX/xxxx1/datafile/epmrange107.dbf';
Enter value for block_to_patch: 543804
old  17: v_offstart := '&block_to_patch';
new  17: v_offstart := '543804';
PL/SQL procedure successfully completed.

最後通過rman檢查,確認一切正常。如下是rman的檢測結果:


.......
RMAN> backup validate datafile 36;
 
Starting backup at 16-OCT-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1337 instance=xxx1 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00036 name=+DATA_KLYX/xxx1/datafile/epmrange107.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
36   OK     0              315258       3276800         14561048845049
  File Name: +DATA_KLYX/xxx1/datafile/epmrange107.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              2926354
  Index      0              20
  Other      0              35168
......

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.