文章目錄
英文原文:http://www.mongodb.org/display/DOCS/MapReduce
MapReduce在mongodb中使用主要做為批處理資料和彙總操作,比較像Hadoop,所有的輸入來自一個結合,所有的輸出到一個集合,更像是傳統關聯式資料庫中的group彙總操作,mapreduce是一個很有用的工具在mongodb中。
在mongodb中索引和標準的查詢很大程度上依賴於map/reduce,如果你在過去使用過CouchDB ,注意couchdb和mongodb是很大不同的,mongodb中的索引和查詢更像是mysql中的索引與查詢。
map/reduce 是mongodb的一個命令介面,特別是用在集合的輸出操作上效果更佳,map和reduce函數通過javascript來編寫,然後在伺服器中執行,命令格式文法如下
db.runCommand( { mapreduce : <collection>, map : <mapfunction>, reduce : <reducefunction> [, query : <query filter object>] [, sort : <sorts the input objects using this key. Useful for optimization, like sorting by the emit key for fewer reduces>] [, limit : <number of objects to return from collection>] [, out : <see output options below>] [, keeptemp: <true|false>] [, finalize : <finalizefunction>] [, scope : <object where fields go into javascript global scope >] [, jsMode : true] [, verbose : true] });
Map-reduce增量
如果你要處理的資料不斷增大,那麼你使用map/reduce有很明顯的優勢,但是這樣你只能看到總的結果,不能看到每次執行的結果;map/reduce操作主要採取以下步驟:
1. 首先運行一個任務,對集合操作,並輸出結果到一個集合。
2. 當你有更多的資料的時候,運行第二個任務,可以使用選項進行過濾資料。
3. 使用reduce output 選項,通過reduce 函數歸併新的資料到一個新的集合。
Output otions
"collectionName" - By default the output will by of type "replace". { replace : "collectionName" } - the output will be inserted into a collection which will atomically replace any existing collection with the same name. { merge : "collectionName" } - This option will merge new data into the old output collection. In other words, if the same key exists in both the result set and the old collection, the new key will overwrite the old one. { reduce : "collectionName" } - If documents exists for a given key in the result set and in the old collection, then a reduce operation (using the specified reduce function) will be performed on the two values and the result will be written to the output collection. If a finalize function was provided, this will be run after the reduce as well. { inline : 1} - With this option, no collection will be created, and the whole map-reduce operation will happen in RAM. Also, the results of the map-reduce will be returned within the result object. Note that this option is possible only when the result set fits within the 16MB limit of a single document.
Result object
{ [results : <document_array>,] [result : <collection_name> | {db: <db>, collection: <collection_name>},] timeMillis : <job_time>, counts : { input : <number of objects scanned>, emit : <number of times emit was called>, output : <number of items in output collection> } , ok : <1_if_ok> [, err : <errmsg_if_error>]}
Map函數
map函數的內部變數指向當前文檔對象,map函數調用emit(key,value) 一定次數,把資料給reduce函數,大部分情況下,對每個文檔執行一次,但有些情況下也可能執行多次emit。
reduce函數
執行map/reduce操作,reduce函數主要用來收集map中emit執行的結果資料,並計算出一個值。
下面給出一個python的mongodb用戶端的map-reduce例子,如下:
#!/usr/bin env python#coding=utf-8from pymongo import Connectionconnection = Connection('localhost', 27017)db = connection.map_reduce_exampledb.things.remove({})db.things.insert({"x": 1, "tags": ["dog", "cat"]})db.things.insert({"x": 2, "tags": ["cat"]})db.things.insert({"x": 3, "tags": ["mouse", "cat", "dog"]})db.things.insert({"x": 4, "tags": []})from bson.code import Codemapfun = Code("function () {this.tags.forEach(function(z) {emit(z, 1);});}")reducefun = Code("function (key, values) {" " var total = 0;" " for (var i = 0; i < values.length; i++) {" " total += values[i];" " }" " return total;" "}")result = db.things.map_reduce(mapfun, reducefun, "myresults")for doc in result.find(): print docprint "#################################################################"result = db.things.map_reduce(mapfun, reducefun, "myresults", query={"x": {"$lt": 3}})for doc in result.find(): print docprint "#################################################################"
執行結果如下:
{u'_id': u'cat', u'value': 3.0}{u'_id': u'dog', u'value': 2.0}{u'_id': u'mouse', u'value': 1.0}#################################################################{u'_id': u'cat', u'value': 2.0}{u'_id': u'dog', u'value': 1.0}#################################################################