Topics on MongoDB data aggregation
Http://blog.nosqlfan.com/html/3548.html
Mapreduce
Mapreduce is a computing model. Simply put, it is to execute a large number of jobs (data) into maps, and then merge the results into the final result (reduce ). The advantage of doing so is that after a task is decomposed, parallel computing can be performed through a large number of machines to reduce the entire operation time.
The best example for programmers who were born in the course class is the case of merging and sorting. Yes, the merging and sorting process can be considered as a mapreduce, the merge sorting program we have written in school may not involve parallel computing.
The above is the theoretical part of mapreduce. The following describes the actual application. The following uses MongoDB mapreduce as an example.
The following is an official MongoDB example:
$ ./mongo> db.things.insert( { _id : 1, tags : ['dog', 'cat'] } );> db.things.insert( { _id : 2, tags : ['cat'] } );> db.things.insert( { _id : 3, tags : ['mouse', 'cat', 'dog'] } );> db.things.insert( { _id : 4, tags : [] } );> // map function> m = function(){... this.tags.forEach(... function(z){... emit( z , { count : 1 } );... }... );...};> // reduce function> r = function( key , values ){... var total = 0;... for ( var i=0; i<values.length; i++ )... total += values[i].count;... return { count : total };...};> res = db.things.mapReduce(m,r);> res{"timeMillis.emit" : 9 , "result" : "mr.things.1254430454.3" , "numObjects" : 4 , "timeMillis" : 9 , "errmsg" : "" , "ok" : 0}> db[res.result].find(){"_id" : "cat" , "value" : {"count" : 3}}{"_id" : "dog" , "value" : {"count" : 2}}{"_id" : "mouse" , "value" : {"count" : 1}} > db[res.result].drop()
The example is simple. Calculate the number of times each tag appears in a tag system.
All except emit functions are standard JS syntaxes. Of course, you can also use all the standard JS functions you know. This emit function is very important. Its function is to put a piece of data into a Data grouping set. This grouping is based on the key of the first parameter of emit. As you can understand, when you run the map function on all the rows to be calculated, you get a set of key-values pairs. The basic key is the key in emit, and values is a set of the second parameters of each emit function.
Now our task is to change this key-values to key-value, that is, to change this set to a single value. This operation is reduce.
It seems that this is exactly the same as our previous theory. When the values set in key-values is too large, it will be split into many small key-values blocks and then run the reduce function separately, then combine the results of multiple blocks into a new set and use them as the second parameter of the reduce function to continue the CER operation. We can foresee that if our initial values is very large, we may reduce the set formed after the first chunk calculation again. This is similar to multi-order Merge Sorting. The specific amount depends on the data volume.
We don't need to know much about the above internal mechanism, but we must understand the principle that this mechanism will require us to follow, that is, when we write a map function, the second parameter form of emit is the second parameter of our reduce function. The return value of the reduce function may be used as a new input parameter to execute the reduce operation again, therefore, the return value of the reduce function must be consistent with the second parameter structure of the reduce function.
As a result, the following describes the MongoDB mapreduce call parameters and returned results.
The parameter table is as follows:
db.runCommand( { mapreduce : <collection>, map : <mapfunction>, reduce : <reducefunction> [, query : <query filter object>] [, sort : <sort the query. useful for optimization>] [, limit : <number of objects to return from collection>] [, out : <output-collection name>] [, keeptemp: <true|false>] [, finalize : <finalizefunction>] [, scope : <object where fields go into javascript global scope >] [, verbose : true] });
- Mapreduce: Specifies the collection for mapreduce processing.
- Map: map function
- Reduce: reduce Function
- Query: A filtering condition. Only rows that meet the conditions are added to the mapreduce set. This filtering process is executed prior to the entire mapreduce process.
- Sort: the sort sorting parameter combined with query. This is the only option to optimize the grouping mechanism.
- Limit: Same as above
- Out: name of the collection output. If this parameter is not specified, a collection with a random name will be created by default.
- Keytemp: true or false, indicating whether the result output to the collection is temporary. If it is true, it is automatically deleted after the client connection is interrupted, if you are using a MongoDB Mongo client connection, it will be deleted only after exit. If the script is executed, exit the script or call close to automatically delete the result collection.
- Finalize: similar to map and reduce, it is a function. It can calculate the key and value and return a final result after reduce returns a result.
- Scope: Set the parameter value. The value set here is visible in the map, reduce, and finalize functions.
- Verbose: prints debugging information during execution.
The returned result structure is as follows:
{ result : <collection_name>, counts : { input : <number of objects scanned>, emit : <number of times emit was called>, output : <number of items in output collection> } , timeMillis : <job_time>, ok : <1_if_ok>, [, err : <errmsg_if_error>]}
- Result: The name of the collection that stores the result.
- Input: number of rows meeting the condition
- Emit: the number of emit calls, that is, the total amount of data in all sets.
- Ouput: number of returned results
- Timemillis: execution time, in milliseconds
- OK: whether the operation is successful. The success value is 1.
- Err: if a failure occurs, the cause of the failure can be found here. However, from experience, the cause is vague and does not play a major role.