OGG的extract進程checkpoint時間點回到1988-01-01 00:00:00故障處理,checkpoint

來源:互聯網
上載者:User

OGG的extract進程checkpoint時間點回到1988-01-01 00:00:00故障處理,checkpoint
1、故障現象

    Extract進程(SEXTR01)狀態為running,但是Lag at Chkpt卻達到5個多小時,且時間一直在增長,根本就不抽取新日誌,狀態資訊如下:

GGSCI (calladgdb) 21> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                          

EXTRACT     RUNNING     DPEYWGL     00:00:00      00:00:04   

EXTRACT     RUNNING     SEXTR01     05:14:58      00:00:03   

 

2、故障分析2.1 debug一下進程狀態資訊

GGSCI (calladgdb) 23> info sextr01, showch debug

 

EXTRACT    SEXTR01   Last Started 2015-01-25 22:51   Status RUNNING

Checkpoint Lag       05:21:26 (updated 00:00:04 ago)

Log Read Checkpoint  Oracle Redo Logs

                     2015-01-25 18:05:16  Thread 1, Seqno 5554, RBA 611373536

                     SCN 3126.136656462 (13426204423758)

Log Read Checkpoint  Oracle Redo Logs

                     1988-01-01 00:00:00  Thread 2, Seqno 4016, RBA 1024

                     SCN 44527.110828288 (191242119617280)

 

 

Current Checkpoint Detail:

 

Read Checkpoint #1

……

 

Read Checkpoint #2

 

  Oracle Threaded Redo Log

 

  Startup Checkpoint (starting position in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp: 2015-01-25 18:03:51.000000

    SCN: 3126.136602577 (13426204369873)

    Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791

 

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp: 2015-01-25 18:03:51.000000

    SCN: 3126.136602577 (13426204369873)

    Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791

 

  Current Checkpoint (position of last record read in the data source):

    Thread #: 2

    Sequence #: 4016

    RBA: 1024

    Timestamp: 1988-01-01 00:00:00.000000

    SCN: 44527.110828288 (191242119617280)

    Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793

……

 

從這裡發現一個不可思議的問題:

      Thread #: 2的current checkpoint 的timestamp竟然是1988-01-0100:00:00.000000,SCN的十進位數,竟然為191242119617280,遠遠大於當前資料庫的SCN,多出一位元。

      時間倒退,SCN號反而大漲,時間還那麼有個性。這個問題挺有意思。

2.2 view extract進程的report

    使用view report sextr01查看進程的報告,提示有ERROR,thread#2 4016 archive log 不能開啟。(由於筆者沒有將資訊複製下來,所以當時的狀態只能這麼描述了),但是4016號archivelog實際物理檔案是存在的,而且許可權正常。

2.3 將進程停止下來,showch其狀態

GGSCI (calladgdb) 24>stop sextr01

GGSCI (calladgdb) 25> info SEXTR01, showch

EXTRACT    SEXTR01   Last Started 2015-01-26 01:00   Status STOPPED

Checkpoint Lag       06:56:55 (updated 00:00:58 ago)

Log Read Checkpoint  Oracle Redo Logs

                     2015-01-25 18:05:16  Thread 1, Seqno 5554, RBA 611373536

                     SCN 3126.136656462 (13426204423758)

Log Read Checkpoint  Oracle Redo Logs

                     1988-01-01 00:00:00  Thread 2, Seqno 4016, RBA 1024

                     SCN 44527.110828288 (191242119617280)

 

Current Checkpoint Detail:

 

Read Checkpoint #1

……

 

Read Checkpoint #2

 

  Oracle Threaded Redo Log

 

  Startup Checkpoint (starting position in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp: 2015-01-25 18:03:51.000000

    SCN: 3126.136602577 (13426204369873)

    Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791

 

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp: 2015-01-25 18:03:51.000000

    SCN: 3126.136602577 (13426204369873)

    Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791

 

  Current Checkpoint (position of last record read in the data source):

    Thread #: 2

    Sequence #: 4016

    RBA: 1024

    Timestamp: 1988-01-01 00:00:00.000000

    SCN: 44527.110828288 (191242119617280)

    Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793

……


3 分析原因

       進程的thread#2的checkpoint time竟然是1988-01-01 00:00:00.000000,肯定是不正常,結合error資訊為找不到4016歸檔記錄檔,以及進程的current checkpoint thread#2的sequence#:4016,而RBA卻為1024。

既然RBA都已經運行取1024了,就不應該找不到archivelog的情況,如果4016號歸檔記錄檔還沒有開始抽取,RBA號就應該是0。

      根據此分析,決定手動的將extract進程的chkeckpoint恢得到recover chkeckpoint點,以及next extseqno調整到4016的0號RBA試一下,看能否解決。


4、解決辦法4.1 調整netxt sequno為4016,RBA為0

GGSCI (calladgdb)2> alter SEXTR01, thread 2, extseqno 4016, extrba 0

4.2 調整ioextseqno與ioextrba到進程的rechvercheckpoint狀態

GGSCI (calladgdb)3> alter SEXTR01, thread 2, ioextseqno 4015, ioextrba 640984080

4.3 驗證調整結果

GGSCI (calladgdb)4> info sextr01, showch

EXTRACT    SEXTR01   Initialized   2015-01-26 01:48   Status STOPPED

Checkpoint Lag       07:43:41 (updated 00:00:51 ago)

Log Read Checkpoint  Oracle Redo Logs

                     2015-01-25 18:05:16  Thread 1, Seqno 5554, RBA 611373536

                     SCN 3126.136656462 (13426204423758)

Log Read Checkpoint  Oracle Redo Logs

                     1988-01-01 00:00:00  Thread 2, Seqno 4016, RBA 0

                     SCN 0.0 (0)

 

Current Checkpoint Detail:

 

Read Checkpoint #1

……

 

Read Checkpoint #2

 

  Oracle Threaded Redo Log

 

  Startup Checkpoint (starting position in the data source):

    Thread #: 2

    Sequence #: 4016

    RBA: 0

    Timestamp: 1988-01-01 00:00:00.000000

    SCN: Not available

    Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793

 

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp:

    SCN: Not available

    Redo File:

 

  Current Checkpoint (position of last record read in the data source):

    Thread #: 2

    Sequence #: 4016

    RBA: 0

    Timestamp: 1988-01-01 00:00:00.000000

    SCN: Not available

    Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793

……

   Start checkpoint和current chkeckpoint的sequence#都變成了4016,RBA號都變成了0

4.4 啟動extract進程sextr01,並做brreset操作

GGSCI (calladgdb)6> startSEXTR01, brreset

GGSCI (calladgdb) 6> start SEXTR01, brreset

 

Sending START request to MANAGER ...

EXTRACT SEXTR01 starting

 

GGSCI (calladgdb) 7> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                           

EXTRACT     RUNNING     DPEYWGL     00:00:00      00:00:01   

EXTRACT     STARTING    SEXTR01     07:43:41      00:03:47   

 

4.5 多做幾次進程狀態資訊顯示,驗證延遲下降效果

GGSCI (calladgdb) 10> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                          

EXTRACT     RUNNING     DPEYWGL     04:46:20      00:00:01   

EXTRACT     RUNNING     SEXTR01     04:23:13      00:00:01   

 

GGSCI (calladgdb) 11> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                          

EXTRACT     RUNNING     DPEYWGL     01:53:22      00:00:09   

EXTRACT     RUNNING     SEXTR01     01:27:25      00:00:08   

 

GGSCI (calladgdb) 12> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                          

EXTRACT     RUNNING     DPEYWGL     00:00:00      00:00:01   

EXTRACT     RUNNING     SEXTR01     00:00:02      00:00:00   

 

   通過幾次info all查看進程狀態,看到lag at chkpt時間快速下降。問題解決。

 

 

 

本文作者:黎俊傑(網名:踩點),從事”系統架構、作業系統、存放裝置、資料庫、中介軟體、應用程式“六個層面系統性的效能最佳化工作

歡迎加入 系統效能最佳化專業群,共同探討效能最佳化技術。群號:258187244

相關文章

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.