About MongoDB's MapReduce

Source: Internet
Author: User
Tags emit

Due to the limitations of the Nodejs itself, it is inefficient to use JS for high-volume calculation in the program. The V8 engine itself limits memory size (1.4G under 64-bit systems), which also limits the size of the data.

Therefore, compared to extracting data from MongoDB, it is more satisfying to avoid nodejs self-limiting scheme in MongoDB by using aggregation function or other method to calculate.

MongoDB supports aggregation functions similar to SQL, although the syntax is not, but the rationale is similar.

In MongoDB's own interface, aggregate is used to implement aggregate queries:

REC = db. Library.aggregate ([{
    $match: {
        Date: {
            ' 2220832129000 ',
            ' 3330918529000 '
        }
    }
}, {
    $group: {
        ' $book _id ',
        maxscience: {
            ' $shelf. Science ' 
        },
        Maxmath: {
            ' $data. Math '
        },
        Avgalgebra: {
            ' $data. Algebra ' 
        }
    }
}])

Various kinds of operators can be used to implement various aggregation operations in $group.

If the goal is to generate a simple data structure with few fields, the aggregation operation can be almost one step in place.

It is important to note that in the absence of a format conversion, JS has a vague distinction between strings and numbers. If you use the Max function with a string variable, the result will be "999" > "1234". If the MONGODB internal data format is not canonical, the desired result may not be obtained.

For complex calculations, you can use the MapReduce method.

MongoDB's native method includes the MapReduce method, traversing all data that conforms to the query condition, and, for each extracted data, a key-value pair is presented by the map method, and all the values of the current key are stored in an array until the key is changed.

The time to trigger the reduce method is to traverse the result set, trigger when the key changes, or, after the traversal, to traverse the new key-value pair to execute the reduce method, which is not currently determined.

The parameters passed to the reduce method include the key and an array of all the values of the current key so far, and when the reduce process is finished, the results of the processing are returned, and the MapReduce process ends.

MapReduce is a very powerful method, and at the moment I can only record the contents of the contact, waiting to be perfected later.

First, it is the core map method and the reduce method.

※map method

Processes the result set of the query condition extraction and executes the map method once for each MongoDB record.

The map method generally does not need to pass in the parameters, in the method weight, this points to the currently processed record, which can be called by this.columnname the fields in the record.

The map method does not require a return value, in the method body, calls the built-in emit method, passing a set of key-value pairs to the MapReduce memory in the form of emit (key, Val). The format of a key-value pair is free, even if a variable that does not exist in the current record, as long as it conforms to the syntax, can also be emit normally outgoing. In the online example, most of the value values in the emit method are numbers, or univariate, but in fact the key value passed by emit can be a complex JSON or even an external method, with the corresponding reduce method, can achieve more complex and more effective functions.

※reduce method

The call timing of the reduce method is currently uncertain, and the basic principle is to receive two parameters, key and values. Where key is the emit key, values are the same value that has been consolidated to focus on the same key in the same array, even if there is only one value, values are stored as arrays.

Before this process, the same value of the same key will not be merged, and the example of Word frequency statistics mentioned in the map reduce concept note is the result of merging after the map reduce process is completed, and before the reduce method is actually executed, The same value of the same key should remain in emit state.

In the reduce method, traversing the values array, according to the structure of value corresponding to the emit method, uniformly processing the result with common characteristics, and finally returning the result, in the return value of the MapReduce method, the value corresponding to the current key is the result returned by the reduce method. The return result that can be summarized as a MapReduce method is a JSON-formatted key-value pair group in the form of a return value in Key:reduce in Emit.

In MongoDB's own JS interface, the direct return value of the MapReduce method is the result statistic of the method:

> rec = db. Data.mapreduce (M, R, {query:{time:{$gte: ' 1450832129000 '}}, sort: {id:1}, out: "Result"})
{
"Result": "Result",
"Timemillis": 50948,
"Counts": {
"Input": 490672,
"Emit": 490672,
"Reduce": 4931,
"Output": 26
},
"OK": 1
}
>

With this interface, you need to set out parameters for the MapReduce method if you need to see the detailed calculation results.

When the value of the out parameter is a string, the result property in the returned results is the value of the string, which is sufficient if you only need to view the statistical results.

If you need to cache the calculation results, you need to set the value of the out parameter to {<OPTION>: The JSON structure of the temporary table name}. The results of the MapReduce calculation are in the temporary table of MongoDB and can be viewed through the Find method.

In the JSON structure of the Out parameter value,,<option> defines the processing of the same key in the temporal table, with a total of three types:

Replace replaces the results that exist in the staging table with the current reduce results

Merge the key of the current reduce result is not present in the temporary table, save the reduce result to the temporary table, if the same key exists, replace the value of the same key in the temporary table directly with the value of the current key

Reduce if the current key exists in a temporary table, map reduce the current and existing results, and re-call the map method with the reduce method to consolidate the results.

It is important to note that when the Quey result set is large, it is possible to reduce the same key multiple times, when the consolidation method that needs to be set out is reduce, otherwise the temporary table will store only the result of one reduce, not all records of the reduce processing results. In this case, it is best to keep the return value structure of the reduce method the same as the value structure of emit in the map method, so that subsequent mapreduce can perform the same map method and the reduce method normally.

This is not the use of MongoDB's own JS interface, but Mongoose, as a third-party MongoDB plug-in, used more than the original to be relatively convenient.

Mongoose's Model.mapreduce method receives two parameters: MapReduce struct-Set object, callback function (because it is an async method)

In the MapReduce object, Mapreduceobj.map specifies the map method, Mapreduceobj.reduce Specifies the reduce method, and for out parameters, the Mongoose MapReduce method is set by default to {Inline:1 }, return the results in the JSON object Format JS, default to the same key calculation results of the reduce method, if the expected result is small, you do not need to set the temporary table as the native method.

In the map method and the reduce method if you need to use an external JS variable, you can set Mapreduceobj.scope ({key: External variable, ...}), in the map or reduce method, directly using the defined key, you can take the value of the external variable.

Optimization of ※mapreduce

For methods that handle large amounts of data, the result set is sorted after query, so that the same key is continuously present when the map method is executed, reducing the number of reduce, which can significantly improve the efficiency of the mapreduce. It is important to note that because the sort is called, it is best to add the MongoDB fields involved in the key value with a federated or stand-alone index, avoiding the MongoDB engine using the default sort method to produce an error.

External variables can also be used to filter out the results of emit, preserving only the required data and reducing redundant computation.

If you are using language support, you can also consider multithreaded execution, the limit parameter of the MapReduce method and the skip parameter, which makes it convenient to block the result set.

In addition, recently read butterfly book, Inside for recursive explanation let me to the realization of the MapReduce method to have some sentiment, mapreduce, the reduce method uses the similar recursive concept, from the result view input, is the input data collection according to the key value different, divides into several blocks, For each block, call the reduce method again, recursively, until the contents of each block are separate records and cannot continue to be split. Recursive invocation of the same Map&reduce method.

MapReduce on MongoDB

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.