有這麼一個大表假設數億條記錄,純資料文本百Gb以上,批量讀取的話咋整。
SQLLoader,算是一個選擇吧。嗯,開並行,資料表本身也許已經partition過了。可是如果自己批量unload資料應該怎麼做呢。多線程並行是個好主意,但是怎麼並行呢。如果你在想什麼數字主鍵均分的想法,那你也許不會有好的結果。原因,主鍵的範圍範圍不一定是均勻分布的,這會造成你某些背景工作執行緒很忙,有些卻草草結束了。深層次的原因,你也許會碰到不同線程同時爭用同一塊資料檔案的情況,這不是也許,可以說是肯定會發生的。那還有更直接的方法嗎。不幸的是,還真有。
思路就是,Oracle存放的資料檔案都是有據可查的,包括檔案編號,blocks個數等等。你要做的就是指定相關資料檔案的起止ROWID就可以了。一般來說表的擴充都是自動均勻的,比如20000條記錄一個等等。這樣我們基本上可以達到均勻分配任務,而且又避免了同一個資料檔案的IO爭用問題。
給出兩個SQL,一個是取自Quest 的 Sqoop Oracle 外掛程式程式,另一個取自前人的經驗,(你知道我在幹嘛了)。
SELECT data_object_id, file_id, relative_fno, file_batch, MIN (start_block_id) start_block_id, MAX (end_block_id) end_block_id, SUM (blocks) blocks FROM (SELECT o.data_object_id, e.file_id, e.relative_fno, e.block_id start_block_id, e.block_id + e.blocks - 1 end_block_id, e.blocks, CEIL ( SUM ( e.blocks) OVER (PARTITION BY o.data_object_id, e.file_id ORDER BY e.block_id ASC) / (SUM (e.blocks) OVER (PARTITION BY o.data_object_id, e.file_id) / :numchunks)) file_batch FROM dba_extents e, dba_objects o, dba_tab_subpartitions tsp WHERE o.owner = :owner AND o.object_name = :object_name AND e.owner = :owner AND e.segment_name = :object_name AND o.owner = e.owner AND o.object_name = e.segment_name AND (o.subobject_name = e.partition_name OR (o.subobject_name IS NULL AND e.partition_name IS NULL)) AND o.owner = tsp.table_owner(+) AND o.object_name = tsp.table_name(+) AND o.subobject_name = tsp.subpartition_name(+) ) GROUP BY data_object_id, file_id, relative_fno, file_batch ORDER BY data_object_id, file_id, relative_fno, file_batch ;
下面這個SQL已經貼心的把起止的ROWID給你算出來了,還要啥單車啊。
select grp, dbms_rowid.rowid_create( 1, data_object_id, lo_fno, lo_block, 0 ) min_rid, dbms_rowid.rowid_create( 1, data_object_id, hi_fno, hi_block, 10000 ) max_rid from (select distinct grp, first_value(relative_fno) over (partition by grp order by relative_fno, block_id rows between unbounded preceding and unbounded following) lo_fno, first_value(block_id ) over (partition by grp order by relative_fno, block_id rows between unbounded preceding and unbounded following) lo_block, last_value(relative_fno) over (partition by grp order by relative_fno, block_id rows between unbounded preceding and unbounded following) hi_fno, last_value(block_id+blocks-1) over (partition by grp order by relative_fno, block_id rows between unbounded preceding and unbounded following) hi_block, sum(blocks) over (partition by grp) sum_blocks from (select relative_fno, block_id, blocks, trunc( (sum(blocks) over (order by relative_fno, block_id)-0.01) / (sum(blocks) over ()/:numchunks) ) grp from dba_extents where segment_name = upper(:object_name) and owner = :owner order by block_id ) ), (select data_object_id from dba_objects where owner=:owner and object_name = upper(:object_name) ) order by grp;
參考:
http://agstamy.blogspot.sg/2011/11/spliting-very-large-table-to-pieces.html
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:29986713177238
http://www.quest.com/data-connector-for-oracle-and-hadoop/
https://github.com/QuestSoftwareTCD/OracleSQOOPconnector