OGG的extract進程checkpoint時間點回到1988-01-01 00:00:00故障處理,checkpoint
1、故障現象
Extract進程(SEXTR01)狀態為running,但是Lag at Chkpt卻達到5個多小時,且時間一直在增長,根本就不抽取新日誌,狀態資訊如下:
GGSCI (calladgdb) 21> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DPEYWGL 00:00:00 00:00:04 EXTRACT RUNNING SEXTR01 05:14:58 00:00:03 |
2、故障分析2.1 debug一下進程狀態資訊
GGSCI (calladgdb) 23> info sextr01, showch debug EXTRACT SEXTR01 Last Started 2015-01-25 22:51 Status RUNNING Checkpoint Lag 05:21:26 (updated 00:00:04 ago) Log Read Checkpoint Oracle Redo Logs 2015-01-25 18:05:16 Thread 1, Seqno 5554, RBA 611373536 SCN 3126.136656462 (13426204423758) Log Read Checkpoint Oracle Redo Logs 1988-01-01 00:00:00 Thread 2, Seqno 4016, RBA 1024 SCN 44527.110828288 (191242119617280) Current Checkpoint Detail: Read Checkpoint #1 …… Read Checkpoint #2 Oracle Threaded Redo Log Startup Checkpoint (starting position in the data source): Thread #: 2 Sequence #: 4015 RBA: 640984080 Timestamp: 2015-01-25 18:03:51.000000 SCN: 3126.136602577 (13426204369873) Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791 Recovery Checkpoint (position of oldest unprocessed transaction in the data source): Thread #: 2 Sequence #: 4015 RBA: 640984080 Timestamp: 2015-01-25 18:03:51.000000 SCN: 3126.136602577 (13426204369873) Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791 Current Checkpoint (position of last record read in the data source): Thread #: 2 Sequence #: 4016 RBA: 1024 Timestamp: 1988-01-01 00:00:00.000000 SCN: 44527.110828288 (191242119617280) Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793 …… |
從這裡發現一個不可思議的問題:
Thread #: 2的current checkpoint 的timestamp竟然是1988-01-0100:00:00.000000,SCN的十進位數,竟然為191242119617280,遠遠大於當前資料庫的SCN,多出一位元。
時間倒退,SCN號反而大漲,時間還那麼有個性。這個問題挺有意思。
2.2 view extract進程的report
使用view report sextr01查看進程的報告,提示有ERROR,thread#2 4016 archive log 不能開啟。(由於筆者沒有將資訊複製下來,所以當時的狀態只能這麼描述了),但是4016號archivelog實際物理檔案是存在的,而且許可權正常。
2.3 將進程停止下來,showch其狀態
GGSCI (calladgdb) 24>stop sextr01 GGSCI (calladgdb) 25> info SEXTR01, showch EXTRACT SEXTR01 Last Started 2015-01-26 01:00 Status STOPPED Checkpoint Lag 06:56:55 (updated 00:00:58 ago) Log Read Checkpoint Oracle Redo Logs 2015-01-25 18:05:16 Thread 1, Seqno 5554, RBA 611373536 SCN 3126.136656462 (13426204423758) Log Read Checkpoint Oracle Redo Logs 1988-01-01 00:00:00 Thread 2, Seqno 4016, RBA 1024 SCN 44527.110828288 (191242119617280) Current Checkpoint Detail: Read Checkpoint #1 …… Read Checkpoint #2 Oracle Threaded Redo Log Startup Checkpoint (starting position in the data source): Thread #: 2 Sequence #: 4015 RBA: 640984080 Timestamp: 2015-01-25 18:03:51.000000 SCN: 3126.136602577 (13426204369873) Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791 Recovery Checkpoint (position of oldest unprocessed transaction in the data source): Thread #: 2 Sequence #: 4015 RBA: 640984080 Timestamp: 2015-01-25 18:03:51.000000 SCN: 3126.136602577 (13426204369873) Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791 Current Checkpoint (position of last record read in the data source): Thread #: 2 Sequence #: 4016 RBA: 1024 Timestamp: 1988-01-01 00:00:00.000000 SCN: 44527.110828288 (191242119617280) Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793 …… |
3 分析原因
進程的thread#2的checkpoint time竟然是1988-01-01 00:00:00.000000,肯定是不正常,結合error資訊為找不到4016歸檔記錄檔,以及進程的current checkpoint thread#2的sequence#:4016,而RBA卻為1024。
既然RBA都已經運行取1024了,就不應該找不到archivelog的情況,如果4016號歸檔記錄檔還沒有開始抽取,RBA號就應該是0。
根據此分析,決定手動的將extract進程的chkeckpoint恢得到recover chkeckpoint點,以及next extseqno調整到4016的0號RBA試一下,看能否解決。
4、解決辦法4.1 調整netxt sequno為4016,RBA為0
GGSCI (calladgdb)2> alter SEXTR01, thread 2, extseqno 4016, extrba 0
4.2 調整ioextseqno與ioextrba到進程的rechvercheckpoint狀態
GGSCI (calladgdb)3> alter SEXTR01, thread 2, ioextseqno 4015, ioextrba 640984080
4.3 驗證調整結果
GGSCI (calladgdb)4> info sextr01, showch
EXTRACT SEXTR01 Initialized 2015-01-26 01:48 Status STOPPED Checkpoint Lag 07:43:41 (updated 00:00:51 ago) Log Read Checkpoint Oracle Redo Logs 2015-01-25 18:05:16 Thread 1, Seqno 5554, RBA 611373536 SCN 3126.136656462 (13426204423758) Log Read Checkpoint Oracle Redo Logs 1988-01-01 00:00:00 Thread 2, Seqno 4016, RBA 0 SCN 0.0 (0) Current Checkpoint Detail: Read Checkpoint #1 …… Read Checkpoint #2 Oracle Threaded Redo Log Startup Checkpoint (starting position in the data source): Thread #: 2 Sequence #: 4016 RBA: 0 Timestamp: 1988-01-01 00:00:00.000000 SCN: Not available Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793 Recovery Checkpoint (position of oldest unprocessed transaction in the data source): Thread #: 2 Sequence #: 4015 RBA: 640984080 Timestamp: SCN: Not available Redo File: Current Checkpoint (position of last record read in the data source): Thread #: 2 Sequence #: 4016 RBA: 0 Timestamp: 1988-01-01 00:00:00.000000 SCN: Not available Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793 …… |
Start checkpoint和current chkeckpoint的sequence#都變成了4016,RBA號都變成了0
4.4 啟動extract進程sextr01,並做brreset操作
GGSCI (calladgdb)6> startSEXTR01, brreset
GGSCI (calladgdb) 6> start SEXTR01, brreset Sending START request to MANAGER ... EXTRACT SEXTR01 starting GGSCI (calladgdb) 7> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DPEYWGL 00:00:00 00:00:01 EXTRACT STARTING SEXTR01 07:43:41 00:03:47 |
4.5 多做幾次進程狀態資訊顯示,驗證延遲下降效果
GGSCI (calladgdb) 10> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DPEYWGL 04:46:20 00:00:01 EXTRACT RUNNING SEXTR01 04:23:13 00:00:01 GGSCI (calladgdb) 11> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DPEYWGL 01:53:22 00:00:09 EXTRACT RUNNING SEXTR01 01:27:25 00:00:08 GGSCI (calladgdb) 12> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DPEYWGL 00:00:00 00:00:01 EXTRACT RUNNING SEXTR01 00:00:02 00:00:00 |
通過幾次info all查看進程狀態,看到lag at chkpt時間快速下降。問題解決。
本文作者:黎俊傑(網名:踩點),從事”系統架構、作業系統、存放裝置、資料庫、中介軟體、應用程式“六個層面系統性的效能最佳化工作
歡迎加入 系統效能最佳化專業群,共同探討效能最佳化技術。群號:258187244