在 Oracle 資料庫中實現 MapReduce

最後更新：2014-10-11 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

在程式員開發並行程式時，Map-Reduce模式正變得流行起來。這些map-reduce程式通常來平行處理大量資料。本文來示範如何在Oracle資料庫上，通過使用Parallel Pipelined Table函數及並行操作，來實現Map-Reduce程式。（譯者註：table()是oracle中一個函數，可以把定義為Pipelined的function的返回結果進行SQL查詢）

原理：

Pipelined Table函數是在Oracle 9i引入的，作為能在資料流中嵌入過程邏輯代碼方法。從邏輯上說，一個Table函數是可以出現在from子句中，該函數就像資料表一樣的返回多行資料。Table函數同樣也可以接收多行資料做為輸入參數。大多數情況下，Pipelined Table函數可以嵌入到一個資料流中，它讓資料“流”進SQL語句中，從而避免增加一個物理層（直譯：具體化的中介）。再次說明，Pipelined Table函數是可以平行處理的。

為了並行Table函數，開發人員必須指定指定一個鍵對輸入資料進行重定位。Table函數可以直接在PL/SQL, Java, and 中實現，你可以查到關於Table函數的更多資訊、例子以及上面提到的那些功能，網址是：http://download.oracle.com/docs/cd/B10501_01/appdev.920/a96624/08_subs.htm#19677

在多個發行版中，Pipelined Table函數已經被使用者使用，並成為Oracle可擴充基礎功能的一個核心部分。無論是外部使用者，還是Oracle的開發部門，Table函數成為一個有效、簡單的擴充資料庫核心功能的方法。

類似Table函數的功能已經在Oracle內使用，並且是Oracle Spatial 和Oracle Warehouse Builder許多特色功能的實現方式。Oracle Spatial（空間資料處理系統）使用它涉及spatial joins 和許多 spatial data的資料採礦的操作。Oracle Warehouse Builder讓讓使用者使用Table 函數對資料流進行平行處理的邏輯，比如Match-Merge 演算法和其它逐行計算的演算法。

手把手的例子

所有的例子都在omr.sql檔案中。

為了說明並行的使用方法以及用Pipelined Table函數在Oracle資料庫內寫一個Map-Reduce演算法，我們實現一個最經典的map-reduce例子--單詞計數。單詞計數是實現返回一組文檔中所有不重複單詞出現的個數的程式，也可以說是查詢單詞出現頻率功能。

範例程式碼是用PL/SQL實現，但如前所說，Oracle允許你選擇其它語言來實現這個過程邏輯。

1、配置環境

我們將在一組文檔中尋找，這些文檔可以是資料庫之外的檔案中，也可以儲存在Secure Files/CLOB的資料庫內的列中。在我們這個存文檔的表也相當於一個檔案系統。

在本例中，我們將在資料庫內建立一個表，用下面的聲明：

CREATE TABLE documents (a CLOB)       
  LOB(a) STORE AS SECUREFILE(TABLESPACE sysaux);

該表的每一行都對應一個文檔，我們在用下面的語句，這個表中插入三個簡單的文檔：

INSERT INTO documents VALUES ('abc def');       
INSERT INTO documents VALUES ('def ghi');        
INSERT INTO documents VALUES ('ghi jkl');        
commit;

map代碼和reduce代碼都將包含在一個包中，保持代碼的整潔。為了展示這些步驟，我將把這些程式碼片段從包中拿出來，在下面各小節展示。在實際的包中，還必須要定義幾個types。所有代碼均在Oracle Database 11g (11.1.0.6)測試通過。

2、建立Mapper and the Reducer

首先我們要建立一個普通的map函數來給文檔做標記。記住，我們不是要展示這個map函數有多麼好，而是要表達這在資料庫工作的原理。這個map函數非常基本，其它地方也可能有更好的實現。

你可以使用資料庫的彙總引擎及僅map函數來得到最終結果。一個請求和結果看起來是： SQL完成彙總操作，不需要reducer的函數。

當然，你也可以寫自己的彙總的Table函數來計算單詞的出現次數。如果你不用oracle的彙總引擎的話，你必須自己來寫map-reduce的程式。這個彙總Table函數就相當於map-reduce中的reducer部分。

Table函數要求輸入必須按單詞分組，需要將資料排序（用oracle 執行引擎的sort)或單詞分簇。我們展示一個簡單的記數程式在本文中。

第3步，資料庫中進行map-reduce

當你寫完mapper and the reducer後，你就可以在資料庫中進行map-reduce.執行一個包含Table函數的請求，就能對外部文檔進行並行的按照map-reduce的代碼執行。

總結

Oracle Table函數是經得起驗證的技術，並在Oracle的內外廣泛使用的擴充Oracle11g的技術。

Oracle Table函數是穩定並可擴充的方法，在Oracle資料庫內實現Map-Reduce，並且能夠利用Oracle並存執行架構的擴充性。在SQL中利用它，能讓資料庫開發人員用自己熟悉的環境和語言，為他們提供一個有效、簡單的機制去實現Map-Reduce方法。

你可以下載orm.sql,沒有什麼特殊的許可權需求。

附：orm.sql代碼

CREATE TABLE documents (a CLOB)
LOB(a) STORE AS SECUREFILE(TABLESPACE sysaux);

INSERT INTO documents VALUES ('abc def');
INSERT INTO documents VALUES ('def ghi');
INSERT INTO documents VALUES ('ghi jkl');
commit;

create or replace
package oracle_map_reduce is

type word_t is record (word varchar2(4000));
type words_t is table of word_t;

type word_cur_t is ref cursor return word_t;
type wordcnt_t is record (word varchar2(4000), count number);
type wordcnts_t is table of wordcnt_t;

function mapper(doc in sys_refcursor, sep in varchar2) return words_t
pipelined parallel_enable (partition doc by any);

function reducer(in_cur in word_cur_t) return wordcnts_t
pipelined parallel_enable (partition in_cur by hash(word))
cluster in_cur by (word);

end;
/

create or replace
package body oracle_map_reduce is

--
-- The mapper is a simple tokenizer that tokenizes the input documents
-- and emits individual words
--
function mapper(doc in sys_refcursor, sep in varchar2) return words_t
pipelined parallel_enable (partition doc by any)
is
document clob;
istart number;
pos number;
len number;
word_rec word_t;
begin

-- for every document
loop

fetch doc into document;
exit when doc%notfound;

istart := 1;
len := length(document);

-- For every word within a document
while (istart <= len) loop
pos := instr(document, sep, istart);

if (pos = 0) then
word_rec.word := substr(document, istart);
pipe row (word_rec);
istart := len + 1;
else
word_rec.word := substr(document, istart, pos - istart);
pipe row (word_rec);
istart := pos + 1;
end if;

end loop; -- end loop for a single document

end loop; -- end loop for all documents

return;

end mapper;

--
-- The reducer emits words and the number of times they're seen
--
function reducer(in_cur in word_cur_t) return wordcnts_t
pipelined parallel_enable (partition in_cur by hash(word))
cluster in_cur by (word)
is
word_count wordcnt_t;
next varchar2(4000);
begin

word_count.count := 0;

loop

fetch in_cur into next;
exit when in_cur%notfound;

if (word_count.word is null) then

word_count.word := next;
word_count.count := word_count.count + 1;

elsif (next <> word_count.word) then

pipe row (word_count);
word_count.word := next;
word_count.count := 1;

else

word_count.count := word_count.count + 1;

end if;

end loop;

if word_count.count <> 0 then
pipe row (word_count);
end if;

return;

end reducer;

end;
/

-- Select statements

select word, count(*)
from (
select value(map_result).word word
from table(oracle_map_reduce.mapper(cursor(select a from documents), ' ')) map_result)
group by (word);

select *
from table(oracle_map_reduce.reducer(
cursor(select value(map_result).word word
from table(oracle_map_reduce.mapper(
cursor(select a from documents), ' ')) map_result)));

英文原文：In-Database MapReduce (Map-Reduce)

Oracle 11g 在RedHat Linux 5.8_x64平台的安裝手冊

Linux-6-64下安裝Oracle 12C筆記

在CentOS 6.4下安裝Oracle 11gR2(x64)

Oracle 11gR2 在VMWare虛擬機器中安裝步驟

Debian 下安裝 Oracle 11g XE R2

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

在 Oracle 資料庫中實現 MapReduce

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support