Familiar with MongoDB mapreduce

Last Update:2018-12-05 Source: Internet

Author: User

Tags emit mongodb example

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Topics on MongoDB data aggregation

Http://blog.nosqlfan.com/html/3548.html

Mapreduce

Mapreduce is a computing model. Simply put, it is to execute a large number of jobs (data) into maps, and then merge the results into the final result (reduce ). The advantage of doing so is that after a task is decomposed, parallel computing can be performed through a large number of machines to reduce the entire operation time.

The best example for programmers who were born in the course class is the case of merging and sorting. Yes, the merging and sorting process can be considered as a mapreduce, the merge sorting program we have written in school may not involve parallel computing.

The above is the theoretical part of mapreduce. The following describes the actual application. The following uses MongoDB mapreduce as an example.

The following is an official MongoDB example:

$ ./mongo> db.things.insert( { _id : 1, tags : ['dog', 'cat'] } );> db.things.insert( { _id : 2, tags : ['cat'] } );> db.things.insert( { _id : 3, tags : ['mouse', 'cat', 'dog'] } );> db.things.insert( { _id : 4, tags : []  } );> // map function> m = function(){...    this.tags.forEach(...        function(z){...            emit( z , { count : 1 } );...        }...    );...};> // reduce function> r = function( key , values ){...    var total = 0;...    for ( var i=0; i<values.length; i++ )...        total += values[i].count;...    return { count : total };...};> res = db.things.mapReduce(m,r);> res{"timeMillis.emit" : 9 , "result" : "mr.things.1254430454.3" , "numObjects" : 4 , "timeMillis" : 9 , "errmsg" : "" , "ok" : 0}> db[res.result].find(){"_id" : "cat" , "value" : {"count" : 3}}{"_id" : "dog" , "value" : {"count" : 2}}{"_id" : "mouse" , "value" : {"count" : 1}} > db[res.result].drop()

The example is simple. Calculate the number of times each tag appears in a tag system.

All except emit functions are standard JS syntaxes. Of course, you can also use all the standard JS functions you know. This emit function is very important. Its function is to put a piece of data into a Data grouping set. This grouping is based on the key of the first parameter of emit. As you can understand, when you run the map function on all the rows to be calculated, you get a set of key-values pairs. The basic key is the key in emit, and values is a set of the second parameters of each emit function.

Now our task is to change this key-values to key-value, that is, to change this set to a single value. This operation is reduce.

It seems that this is exactly the same as our previous theory. When the values set in key-values is too large, it will be split into many small key-values blocks and then run the reduce function separately, then combine the results of multiple blocks into a new set and use them as the second parameter of the reduce function to continue the CER operation. We can foresee that if our initial values is very large, we may reduce the set formed after the first chunk calculation again. This is similar to multi-order Merge Sorting. The specific amount depends on the data volume.

We don't need to know much about the above internal mechanism, but we must understand the principle that this mechanism will require us to follow, that is, when we write a map function, the second parameter form of emit is the second parameter of our reduce function. The return value of the reduce function may be used as a new input parameter to execute the reduce operation again, therefore, the return value of the reduce function must be consistent with the second parameter structure of the reduce function.

As a result, the following describes the MongoDB mapreduce call parameters and returned results.

The parameter table is as follows:

db.runCommand( { mapreduce : <collection>,   map : <mapfunction>,   reduce : <reducefunction>   [, query : <query filter object>]   [, sort : <sort the query.  useful for optimization>]   [, limit : <number of objects to return from collection>]   [, out : <output-collection name>]   [, keeptemp: <true|false>]   [, finalize : <finalizefunction>]   [, scope : <object where fields go into javascript global scope >]   [, verbose : true] });

Mapreduce: Specifies the collection for mapreduce processing.
Map: map function
Reduce: reduce Function
Query: A filtering condition. Only rows that meet the conditions are added to the mapreduce set. This filtering process is executed prior to the entire mapreduce process.
Sort: the sort sorting parameter combined with query. This is the only option to optimize the grouping mechanism.
Limit: Same as above
Out: name of the collection output. If this parameter is not specified, a collection with a random name will be created by default.
Keytemp: true or false, indicating whether the result output to the collection is temporary. If it is true, it is automatically deleted after the client connection is interrupted, if you are using a MongoDB Mongo client connection, it will be deleted only after exit. If the script is executed, exit the script or call close to automatically delete the result collection.
Finalize: similar to map and reduce, it is a function. It can calculate the key and value and return a final result after reduce returns a result.
Scope: Set the parameter value. The value set here is visible in the map, reduce, and finalize functions.
Verbose: prints debugging information during execution.

The returned result structure is as follows:

{ result : <collection_name>,  counts : {       input :  <number of objects scanned>,       emit  : <number of times emit was called>,       output : <number of items in output collection>  } ,  timeMillis : <job_time>,  ok : <1_if_ok>,  [, err : <errmsg_if_error>]}

Result: The name of the collection that stores the result.
Input: number of rows meeting the condition
Emit: the number of emit calls, that is, the total amount of data in all sets.
Ouput: number of returned results
Timemillis: execution time, in milliseconds
OK: whether the operation is successful. The success value is 1.
Err: if a failure occurs, the cause of the failure can be found here. However, from experience, the cause is vague and does not play a major role.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More