GoldenGate Extract進程hang問題解決一例
一直運行正常的extract進程突然hang住了,起初懷疑是不是找不到歸檔所致,但是細細的檢查了下,extract所需的歸檔都還在,而且日誌中也沒報任何錯。而且將extract進程停掉後,也能正常起來,沒有任何錯誤或者警示:
// ggserr.log
2015-05-25 11:03:27 INFO OGG-01513 Oracle GoldenGate Capture for Oracle, elis.prm: Positioning to (Thread 1) Sequence 5824, RBA 290544656, SCN 1415.2504802626.
2015-05-25 11:03:27 INFO OGG-01516 Oracle GoldenGate Capture for Oracle, elis.prm: Positioned to (Thread 1) Sequence 5824, RBA 290544656, SCN 1415.2504802626, May 20, 2015 4:50:25 PM.
2015-05-25 11:03:27 INFO OGG-01513 Oracle GoldenGate Capture for Oracle, elis.prm: Positioning to (Thread 2) Sequence 4798, RBA 18781712, SCN 1415.2504799356.
2015-05-25 11:03:27 INFO OGG-01516 Oracle GoldenGate Capture for Oracle, elis.prm: Positioned to (Thread 2) Sequence 4798, RBA 18781712, SCN 1415.2504799356, May 20, 2015 4:50:19 PM.
2015-05-25 11:03:27 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, elis.prm: EXTRACT ELIS started.
//extract進程的rpt檔案
2015-05-25 11:03:27 INFO OGG-00546 Default thread stack size: 196608.
2015-05-25 11:03:27 INFO OGG-00547 Increasing thread stack size from 196608 to 1048576.
2015-05-25 11:03:27 INFO OGG-01513 Positioning to (Thread 1) Sequence 5824, RBA 290544656, SCN 1415.2504802626.
2015-05-25 11:03:27 INFO OGG-01516 Positioned to (Thread 1) Sequence 5824, RBA 290544656, SCN 1415.2504802626, May 20, 2015 4:50:25 PM.
2015-05-25 11:03:27 INFO OGG-01513 Positioning to (Thread 2) Sequence 4798, RBA 18781712, SCN 1415.2504799356.
2015-05-25 11:03:27 INFO OGG-01516 Positioned to (Thread 2) Sequence 4798, RBA 18781712, SCN 1415.2504799356, May 20, 2015 4:50:19 PM.
2015-05-25 11:03:27 INFO OGG-01517 Position of first record processed for Thread 1, Sequence 5824, RBA 290544656, SCN 1415.2504802623, May 20, 2015 4:50:25 PM.
2015-05-25 11:03:27 INFO OGG-01644 BOUNDED RECOVERY: COMPLETE: for object pool 1: p33619994_Redo Thread 1 at SeqNo: 5824, RBA: 290545168, SCN: 1415.2504802626 (6079883526466), Thread: 1.
2015-05-25 11:03:27 INFO OGG-01055 Recovery initialization completed for target file /ogg/ggate/data/dirext/lis/ea002534, at RBA 1109.
2015-05-25 11:03:27 INFO OGG-01478 Output file /ogg/ggate/data/dirext/lis/ea is using format RELEASE 11.2.
2015-05-25 11:03:27 INFO OGG-01517 Position of first record processed for Thread 2, Sequence 4798, RBA 18781712, SCN 1415.2504799356, May 20, 2015 4:50:19 PM.
2015-05-25 11:03:27 INFO OGG-01026 Rolling over remote file /ogg/ggate/data/dirext/lis/ea002534.
2015-05-25 11:03:27 INFO OGG-01053 Recovery completed for target file /ogg/ggate/data/dirext/lis/ea002535, at RBA 1109.
2015-05-25 11:03:27 INFO OGG-01057 Recovery completed for all targets.
從日誌中看,extract進程沒有問題,但是extract進程實際上是運行不正常的:
GGSCI (hxddlis01) 8> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
JAGENT RUNNING
EXTRACT RUNNING ELIS 96:14:58 00:00:04
EXTRACT RUNNING PL2IS 00:00:00 00:00:00
EXTRACT RUNNING PLIS 00:00:00 00:00:09
extract進程的lag越來越大,而且trail檔案也不再增長。這情況真是讓人一頭霧水。我們先來看看當前系統的運行環境:
GoldenGate:11.2.1.0.27
Database: 11.2.0.3 兩節點的RAC,使用ASM
OS: AIX 6.1
然後搜尋Oracle Support,找到一篇文章1432994.1,和碰到的情形有點相像。在我們的環境中也是使用了DBLOGREADER,但是不同的是DBLOGREADERBUFSIZE已經配置成1M:
...
TRANLOGOPTIONS BUFSIZE 1048576
TRANLOGOPTIONS DBLOGREADER, DBLOGREADERBUFSIZE 1048576
--TRANLOGOPTIONS _NOREADAHEAD ANY
EXTTRAIL /ogg/ggate/data/dirext/lis/ea
...
這也是AIX下能配置的最大值。難道該值還是太大的緣故?所以就嘗試將該值改成了512K:
...
TRANLOGOPTIONS BUFSIZE 1048576
TRANLOGOPTIONS DBLOGREADER, DBLOGREADERBUFSIZE 524288
--TRANLOGOPTIONS _NOREADAHEAD ANY
EXTTRAIL /ogg/ggate/data/dirext/lis/ea
...
然後重啟extract進程,結果就正常了。估計使用DBLOGREADER還是有不少的bug。