oracle恢複之11gR2 rac恢複案例

來源:互聯網
上載者:User


這是昨天節假如接到的某客戶的緊急救援資料恢複案例。大致的情況是由於掉電導致資料庫無法open。經過初步排查,確認資料庫版本為Oracle 11.2.0.3(linux RAC),資料量比較小,
大約200G左右。整個恢複過程開始看上去很順利,僅30分鐘就順利開啟了資料庫,後續發現其中確實有少坑,這裡跟大家簡單分享一下這個清明節加班的恢複case。
首先我們來看下資料庫無法open所報的錯誤是什嗎?


Sun Apr 03 20:55:36 2016
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, SCN: 0x0000.5edc85a7):
select ctime, mtime, stime from obj$ where obj# = :1
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 19990): terminating the instance due to error 704
Instance terminated by USER, pid = 19990
ORA-1092 signalled during: alter database open...
opiodr aborting process unknown ospid (19990) as a result of ORA-1092
這個錯誤其實很常見,已經遇到很多次了,處理方式也不難;大致上有兩種.
1、通過10046 trace定位到有問題的資料區塊,然後手工去屏蔽事務;
2、推進資料庫SCN
這裡我選擇使用推進scn的方式來進行處理。
直接通過oradebug poke修改scn;第一次修改可能是增加的scn不夠大;第一次報錯一樣;第二次報錯改變了;變成我們更加熟悉的錯誤:
 

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc  (incident=2108431):
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_2108431/orcl1_ora_23188_i2108431.trc
Sun Apr 03 21:09:46 2016
Dumping diagnostic data in directory=[cdmp_20160403210946], requested by (instance=1, osid=23188), summary=[incident=2108431].
Sun Apr 03 21:09:46 2016
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 23188): terminating the instance due to error 600
上述的這個錯誤處理方式其實也有2種,大致如下:
1、由於scn差距很小,因此直接適當推進scn即可。
2、bbed修改dba地址20971648 中的事務來繞過該錯誤。
很明顯,這裡我選擇第1種方法更簡單;這裡我再次修改scn,稍微增加大一點即可;很順利的開啟了資料庫。


SQL> startup mount pfile='/tmp/pfile.ora';
ORACLE instance started.
 
Total System Global Area 2.0243E+10 bytes
Fixed Size          2237088 bytes
Variable Size        7449087328 bytes
Database Buffers     1.2751E+10 bytes
Redo Buffers           41189376 bytes
Database mounted.
SQL> oradebug setmypid
Statement processed.
SQL> alter system set job_queue_processes=0;
 
System altered.
 
SQL> oradebug poke 0x060019598  4 0x832B8852
BEFORE: [060019598, 06001959C) = 00000000
AFTER:  [060019598, 06001959C) = 832B8852
SQL> alter database open;
 
Database altered.
 
SQL>

看上去整個恢複過程很簡單,也就不到半小時就開啟了資料庫。可是當我檢查資料庫檔案狀態時,整個資料庫一共有23個資料檔案,其中有11個資料檔案狀態為missing,
這也就是說都無法識別到資料庫檔案。實際上此時資料庫alert log中也在報如下的錯誤,告訴我們這部分資料檔案無法識別:


Sun Apr 03 21:26:09 2016
minact-scn: Inst 1 is now the master inc#:2 mmon proc-id:24523 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
[24583] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:123081664 end:123083044 diff:1380 (13 seconds)
Dictionary check beginning
Tablespace 'NORMING_DATA' #10 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMING_TEMP' #11 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGTEST_TEMP' #12 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGTEST_DATA' #13 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGLJ_TEMP' #14 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGLJ_DATA' #15 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'TABLESPACE_XYZH' #16 found in data dictionary,
but not in the controlfile. Adding to controlfile.
File #13 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00013' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #14 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00014' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #15 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00015' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #16 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00016' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #17 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00017' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #18 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00018' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #19 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00019' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #20 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00020' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #21 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00021' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #22 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00022' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #23 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00023' in the controlfile.
This file can no longer be recovered so it must be dropped.
Dictionary check complete


由於此時資料庫已經開啟了,因此為產生了一個重建控制檔案的指令碼,發現指令碼內容如下:
 

STARTUP NOMOUNT
CREATE CONTROLFILE REUSE DATABASE "ORCL" RESETLOGS  ARCHIVELOG
    MAXLOGFILES 192
    MAXLOGMEMBERS 3
    MAXDATAFILES 1024
    MAXINSTANCES 32
    MAXLOGHISTORY 9344
LOGFILE
  GROUP 1 '+DATA/orcl/onlinelog/group_1.273.850670135'  SIZE 50M BLOCKSIZE 512,
  GROUP 2 '+DATA/orcl/onlinelog/group_2.274.850670135'  SIZE 50M BLOCKSIZE 512
-- STANDBY LOGFILE
DATAFILE
  '+DATA/orcl/datafile/system.268.850670033',
  '+DATA/orcl/datafile/sysaux.269.850670033',
  '+DATA/orcl/datafile/undotbs1.270.850670033',
  '+DATA/orcl/datafile/users.271.850670033',
  '+DATA/orcl/datafile/undotbs2.276.850670237',
  '+DATA/orcl/datafile/datacenter',
  '+DATA/orcl/datafile/partner_platform',
  '+DATA/orcl/datafile/sw_portal',
  '+DATA/orcl/datafile/system.dbf',
  '+DATA/orcl/datafile/system_02.dbf',
  '+DATA/orcl/datafile/user_02.dbf',
  '+DATA/orcl/datafile/user_03.dbf',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00013',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00014',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00015',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00016',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00017',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00018',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00019',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00020',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00021',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00022',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00023'
CHARACTER SET ZHS16GBK;

實際上我問客戶,他們的反饋是之前由於控制檔案損壞,客戶也重建了控制檔案,進行了多次恢複,而且也進行了resetlogs操作。
從上面的資訊來看,不難看出客戶重建控制檔案的時候漏掉了11個資料檔案。由於這部分檔案的資訊在資料字典中存在,因此在open的時候Oracle 會自動進行offline drop。
或許有人要說,直接找到檔案然後重建控制檔案不就行了嗎?確實如此,然而實際上這裡卻並沒有這麼簡單。
我進入到asm磁碟組檢查檔案發現有幾個檔案名稱很奇怪,例如user_02.dbf 實際上link到了system,類似這樣的情況。
這種情況下極容易出錯。爭取的做法查詢dba_data_files進行資料檔案的挨個確認。
確認好asm磁碟組漏掉的4個檔案之後,還有7個檔案位於檔案系統中。全部添加到指令碼中進行建立時發現這些檔案和之前到檔案到resetlogs已經完全不同了。
其實建立控制檔案會報錯ora-01189。
因此這裡還必須手工去修改這11個資料檔案頭的resetlogs資訊;等我將resetlogs資訊全部修改完畢後,可以順利建立控制檔案。
但是當我進行reconver時卻發現需要之前等archivelog,進一步檢查發現歸檔日誌都全部被刪掉了。
因此最後還必須的再次修改這部分資料檔案的checkpoint資訊,將其改成與其他正常的檔案一致,最後可以順利開啟資料庫,
且檢查所有的資料庫檔案狀態均為online狀態,如下所示:

最後再將檔案系統的檔案遷移到asm磁碟組,然後添加redo資訊,啟動rac節點2.

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.