Oracle RAC (ASM)恢複案例(超大資料庫)

來源:互聯網
上載者:User

前幾天某客戶的一個資料庫出現故障,需要我們緊急救援支援。瞭解了一下環境,著實也嚇了一跳,資料量55TB左右。首先我們來看看故障的資訊:
 

Fri Mar 25 22:57:10 2016
Errors in file /oracle/app/oracle/diag/rdbms/njsdb/njsdb1/trace/xxxx1_dia0_30474350.trc  (incident=640081):
ORA-32701: Possible hangs up to hang ID=80 detected
Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_640081/xxxx1_dia0_30474350_i640081.trc
DIA0 requesting termination of session sid:10347 with serial # 16419 (ospid:29884706) on instance 2
    due to a LOCAL, HIGH confidence hang with ID=80.
    Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=80.
Fri Mar 25 22:59:26 2016
minact-scn: useg scan erroring out with error e:12751
Suspending MMON action 'Block Cleanout Optim, Undo Segment Scan' for 82800 seconds
Fri Mar 25 22:59:35 2016
Sweep [inc][640081]: completed
Sweep [inc2][640081]: completed
Fri Mar 25 22:59:57 2016
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw7_27263388.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: -1
Additional information: 8192
WARNING: Write Failed. group:3 disk:35 AU:241007 offset:262144 size:8192
Fri Mar 25 22:59:57 2016
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_10289540.trc:
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: 7
Additional information: 41229344
Additional information: -1
WARNING: Read Failed. group:3 disk:35 AU:20131 offset:540672 size:8192
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw7_27263388.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: -1
Additional information: 8192
WARNING: Write Failed. group:3 disk:35 AU:241007 offset:237568 size:8192
Fri Mar 25 22:59:57 2016
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: -1
Additional information: 8192
。。。。。。
WARNING: failed to write mirror side 1 of virtual extent 5096 logical extent 0 of file 583 in group 3 on disk 35 allocation unit 241007
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 78: Connection timed out
Additional information: -1
Additional information: 8192
。。。。。。
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 5096 logical extent 0 of file 583 in group 3 on disk 35 allocation unit 241007
WARNING: failed to write mirror side 1 of virtual extent 5002 logical extent 0 of file 585 in group 3 on disk 35 allocation unit 242538
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_dbw3_30212292.trc:
ORA-15080: synchronous I/O operation to a disk failed

從前面的日誌可以看出,該資料庫節點從25號22:57開始報錯,開始可能是出現了部分session hung的情況,接著出現了寫失敗的操作。而其中寫失敗的是第35個磁碟。
當然,這裡僅僅是一個warning,因此我們還不能判斷是磁碟是否有問題。
後面我們跟客戶瞭解,當時的現象應該是儲存鏈路出現了異常,導致資料庫IO出現異常。這也符合之前的現象描述。
那麼我們進一步分析後面客戶的操作,看看之前他們都進行了哪些相關的操作?
 

Sat Mar 26 01:13:51 2016
ALTER DATABASE OPEN
This instance was first to open
Abort recovery for domain 0
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_27853116.trc:
ORA-01113: file 763 needs media recovery
ORA-01110: data file 763: '+DATA/xxxx/datafile/ts_icp_bill.1028.881083603'
ORA-1113 signalled during: ALTER DATABASE OPEN...
Sat Mar 26 01:14:14 2016
......
......
Sat Mar 26 02:02:14 2016
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Sat Mar 26 02:02:18 2016
WARNING! Recovering data file 763 from a fuzzy backup. It might be an online
backup taken without entering the begin backup command.
。。。。。
WARNING! Recovering data file 779 from a fuzzy backup. It might be an online
backup taken without entering the begin backup command.
Sat Mar 26 02:02:18 2016
。。。。。。
Sat Mar 26 02:04:15 2016
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pr00_36110522.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DATA/xxxx/datafile/system.261.880958693'
Slave exiting with ORA-1547 exception
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pr00_36110522.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DATA/xxxx/datafile/system.261.880958693'
ORA-1547 signalled during: alter database recover cancel...
Sat Mar 26 02:04:43 2016
 
我們可以看到,客戶進行了正常的alter database open,但是Oracle提示有部分檔案需要recover。那麼進行recover database操作呢,則提示有部分檔案可能來自fuzzy backup.
這是什麼意思呢? 這其實是說這幾個檔案的檢查點比較舊,需要很早之前的日誌來進行recover。
由於這是一個非歸檔的資料庫,因此很可能有問題的這幾個檔案需要日誌已經被覆蓋。
通過比較scn,我們可以判斷這幾個檔案需要的redo資訊已經被覆蓋了。這裡我要提醒大家的是,不要僅僅只看alert log就輕易下判斷。
僅僅看alert log我們可能認為只有幾個檔案問題。後續我想,如果是僅僅有幾個檔案有問題,那麼我跳過這部分檔案進行recover 不就行了嗎? 因為這樣可以實現資料的最大程度恢複。
於是我執行了下面的命令:
 

run {
set until scn 14586918070973;
recover database skip forever tablespace ts_icp_bill,ts_icp_bill_idx,ts_wj_bill,ts_wj_bill_idx,ts_js_bill;
}
上面這個命令,其實是比較致命的,因為Oracle 會將這裡skip的資料表空間裡面的檔案全部進行offline drop。
所以這裡其實上述的做法是有些欠妥的。
我進一步根據檔案的scn和v$log的scn 資訊進行比較,發現其實有605個檔案可能都需要進行recover;因為全庫已經有2000個左右的資料檔案。
這裡我根據scn進行大致判斷然後產生2個指令碼進行檔案層級的recover,大致擷取指令碼如下:


spool recover1.sh
set pagesize 160 long 9999
select 'recover datafile '||file#||';'  from v$datafile_header where checkpoint_change# < xxx;
spool off
 
通過將其他能夠進行正常recover的檔案進行恢複之後,嘗試開啟資料庫。居然能夠正常open資料庫。有些人可能已經到此結束了吧,其實並不然。
大家想一下?雖然資料庫開啟了,我們不能正常recover的605個資料檔案中可能還有部分資料檔案狀態是recover,也就是還不是online的狀態。
這種情況之下,業務是無法訪問的。實際上我這裡查了一下,大概有540個檔案狀態仍然是recover。因此我們現在還需要想辦法怎麼去講這部分檔案online。
處理方法其實並不難,比如通過bbed簡單修改下資料檔案頭的checkpoint資訊,就可以完成了。但是有540個檔案,而且都是ASM環境。
這個修改的工作量可想而已。後面再產生一個指令碼,將資料庫啟動到mount狀態,然後將之前狀態為recover的檔案全部online。
然後進行recover database using backup controlfile操作。接著直接進行alter database open resetlogs。
遺憾的是沒有能夠直接開啟資料庫,報了一個如下的錯誤,該錯誤很常見,mos有問題也提到,可能跟temp有關係。
 

  
 
 
這裡我這裡直接將tempfile 進行drop,然後再重建控制檔案,進行recover後,居然直接開啟資料庫了。
檢查alert log,我發現還存在一個如下的錯誤:
 


Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: +DATA/xxxx/onlinelog/group_1.2177.907527869
  Mem# 1: +FRA/xxxx/onlinelog/group_1.263.907527881
Block recovery completed at rba 1.46.16, scn 3396.1209160607
ORACLE Instance xxxx1 (pid = 26) - Error 607 encountered while recovering transaction (12, 33) on object 143154.
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc:
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [6856], [0], [1], [], [], [], [], [], [], [], [], []
Sat Mar 26 19:05:43 2016
Dumping diagnostic data in directory=[cdmp_20160326190543], requested by (instance=1, osid=10289168 (SMON)), summary=[incident=2400209].
Starting background process QMNC
Sat Mar 26 19:05:44 2016
QMNC started with pid=40, OS id=17432578
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Sat Mar 26 19:05:45 2016
Sweep [inc][2400209]: completed
Sweep [inc2][2400209]: completed
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc  (incident=2400210):
ORA-00600: internal error code, arguments: [6856], [0], [13], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_2400210/xxxx1_smon_10289168_i2400210.trc
Dumping diagnostic data in directory=[cdmp_20160326190545], requested by (instance=1, osid=10289168 (SMON)), summary=[abnormal process termination].
Starting background process CJQ0
Sat Mar 26 19:05:46 2016
CJQ0 started with pid=43, OS id=11010116
Sat Mar 26 19:05:47 2016
db_recovery_file_dest_size of 2047887 MB is 0.21% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 1, block 60 to scn 14586918097855
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: +DATA/xxxx/onlinelog/group_1.2177.907527869
  Mem# 1: +FRA/xxxx/onlinelog/group_1.263.907527881
Dumping diagnostic data in directory=[cdmp_20160326190547], requested by (instance=1, osid=10289168 (SMON)), summary=[incident=2400210].
Block recovery completed at rba 1.91.16, scn 3396.1209160640
ORACLE Instance xxxx1 (pid = 26) - Error 607 encountered while recovering transaction (15, 3) on object 143247.
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_smon_10289168.trc:
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [6856], [0], [13], [], [], [], [], [], [], [], [], []
Starting background process SMCO
Dumping diagnostic data in directory=[cdmp_20160326190549], requested by (instance=1, osid=10289168 (SMON)), summary=[abnormal process termination].
Sat Mar 26 19:05:49 2016
SMCO started with pid=46, OS id=2949376
Setting Resource Manager plan SCHEDULER[0x3198]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Sat Mar 26 19:05:49 2016
 
很明顯,上述錯誤是指smon進程在進行事務恢複時,發現有2個事務無法進行恢複。
看到上述的錯誤,或許有人會說可能是undo出現損壞,導致無法進行事務恢複。實際上這裡並不是,我通過dbv檢查發現undo檔案都是完好的。
無論怎講,這裡要解決這個問題,相對簡單,定位到是什麼對象,重建就好。

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.