MongoDB Aggregation Operations

Source: Internet
Author: User
Tags emit shuffle

In MongoDB, there are two ways to calculate aggregations: Pipeline and MapReduce. Pipeline queries are faster than MapReduce, but the power of MapReduce is the ability to execute complex aggregation logic on multiple servers in parallel. MongoDB does not allow pipeline to consume too much system memory for a single aggregation operation, and if an aggregation operation consumes more than 20% of the memory, MongoDB stops the operation directly and outputs an error message to the client.

One, the aggregation is calculated using the Pipeline method

The Pipeline method uses the Db.collection.aggregate () function to perform aggregation operations, which are computationally fast and simple to operate, but there are two limitations to the Pipeline approach: the memory consumed by a single aggregation operation cannot exceed 20%. The result set returned by the aggregation operation must be limited to within 16MB.

Create the sample data and insert 1000 doc in the collection Foo, with three field:idx,name and age in each doc.

 for (I=0; I<10000; I+ +) {   db.foo. Insert ({"IDX": I,name: "user "+i,age:i);}

1, the use of $match pipe to filter the collection doc, the qualified doc into the pipeline, can reduce the memory consumed by the aggregation operation, improve the efficiency of the aggregation.

Db.foo.aggregate ({$match: {age:{$lte: 25}})

2, use $project pipe character, use part of Doc field to enter subordinate pipeline

db.foo.aggregate ({$match: {age:{$lte:+}}}, {$project: {age:1,idx:1, ' _id ': 0}})

The role of $project pipe character is to select fields, rename fields, and derive fields.

2.1 Select fields

In the $project pipe character, field:1/0, which indicates that field is selected/not selected, and the unused fields are filtered out of the pipeline to reduce the memory consumption of the aggregation operation.

db.foo.aggregate ({$match: {age:{$lte:+}}}, {$project: {age:1,idx:1, ' _id ': 0}})

2.2 Renaming a field to produce a new field

The reference $, in the format: "$field", refers to the value of the field referenced in doc, if you want to reference the field in the inline doc, use "$field 1.filed2" to refer to the field in the inline document field1: Field2 value.

Example, create a new field:preidx that has the same value as the IDX field.

db.foo.aggregate ({$match: {age:{$lte:+}}}, {$project: {age:1, "Preidx": "$idx", Idx:1, "_id": 0  }} )

2.3 Derived fields

In $project, the field is evaluated, and a new field is derived based on the field values and expressions in Doc.

Example, PREIDX is obtained based on the current Doc's idx minus 1

db.foo.aggregate ({$match: {age:{$lte:+}}}, {$project:     {age    :1,    "Preidx": {$subtract: ["$idx", 1]},    idx:1,    "_id": 0}     })

Operators that perform arithmetic operations in $project: + ($add), * ($multiply),/($divide),% ($mod),-($subtract).

For character data, $substr: [expr,start,length] for substring, $concat: [expr1,expr2,,, Exprn], to concatenate expressions together; $toLower: Expr and $ The toupper:expr is used to return the lowercase or uppercase form of expr.

2.4 Grouping operations

Use $group to group the doc according to the value of a specific field, $group the same doc with the same value as the grouping field for the aggregation calculation. If there are no $group pipe characters, then all doc is grouped as one. For each grouping, a specific aggregation value can be calculated based on the business logic. Both the group operation and the sort operation are non-streaming operators, which means that if a new doc is entered, the doc can be processed, and the non-streaming operator means that the document cannot be processed until all the documents have been received. The grouping operator is handled by waiting for all of the doc to be processed before the doc is grouped, and then the individual groupings are sent to the next operator of pipeline for processing.

example, grouping by age, counting the number of doc in each group

db.foo.aggregate ({$match: {age:{$lte:+}}}, {$project: {age:1, "Preidx": {$subtract: ["$idx", 1]}, Idx:1, "_id": 0}, {$group: {"_id": "$age", count:{$sum: 1}})

If there are multiple groups of fields, grouped by age and Age2, this is only for demonstration purposes, and more fields can be used for grouping in the actual product environment.

db.foo.aggregate ({$match: {age:{$lte:+}}}, {$project: {age:1, "Preidx": {$subtract: ["$idx", 1]}, Idx:1, "_id": 0}, {$group: {"_id": {Age: "$age", Age2: "$age"},count:{$sum: 1}})

For each grouping, the Count field calculates the number of doc in each group, the Idxtotal field is the sums that calculates the value of the IDX field in each group, and the Idxmax field is the maximum value of the IDX field value in each group. Idxfirst is the value of the first IDX field in each grouping, not necessarily the smallest.

db.foo.aggregate ({$match: {age:{$lte:+}}}, {$project: {age:1, "Preidx": {$subtract: ["$idx", 1]}, Idx:1, "_id": 0}, {$group:   {    "_id": {Age: "$age", Age2: "$age"},    count:{$sum:1},    idxtotal:{$sum:"$idx"},    idxmax:{$max:"$idx"},    idxfirst:{$first:"$idx" }   }
})

2.5,sort operation, limit operation, and skip operation
Sorts the results of the aggregation operation, and then skips the first 10 doc, taking the first 10 doc of the remaining result set.

db.foo.aggregate ({$match: {age:{$lte:+}}}, {$project: {age:1, "Preidx": {$subtract: ["$idx", 1]}, Idx:1, "_id": 0}, {$group:   {    "_id": {Age: "$age", Age2: "$age"},    count:{$sum:1},    idxtotal:{$sum:"$idx"},    idxmax:{$max:"$idx"},    idxfirst:{$first:"$idx" }   }},{$sort: {age:-1}},{$skip:},{$limit:ten})

Second, the use of MapReduce method to calculate the aggregation
MapReduce is very flexible in its ability to compute very complex aggregation logic, but MapReduce is very slow and should not be used in real-time data analysis. MapReduce can be executed in parallel on multiple servers, each server is only responsible for completing part of the wordload, and finally the wordload is sent to the master server to be merged, the final result set is computed, and the client is returned.

MapReduce is divided into two phases: map and reduce, for example, 10 carriages, which count the number of boys and girls in the 10 carriages. Serial way a section of the car statistics, until the statistics of the total number of cars in the car: 50 men, 40 women.

The idea of using MapReduce is: Each compartment sends a person to the statistics, everyone returns a doc, for example, Keyn:{female:num1,male:num2},keyn is the compartment number, at the same time, there are 10 people working simultaneously, Each person only completes the whole workload 10%, very soon, returns 10 Doc, from Key1 to Key10, only need to add the 10 doc Femal and male respectively together, is the total number of carriages: 50 men, 40 women.

The use of MapReduce to compute aggregations is divided into three main steps: Map,shuffle and reduce,map and reduce need to be explicitly defined, shuffle is implemented by MongoDB.

    • Map: Map operations to each DOC, generate key and value, for example, map a doc, generate (Female,{count:1}), female is key,value {count:1}
    • Shuffle: Group by key and combine the same value of key into the array, for example, generate (Female,[{count:1},{count:1},{count:1},{count:1},,,,,])
    • Reduce: Array of value is reduced to single value, for example, Generation (femal,{count:21})

The best way to use MapReduce for aggregation is to add together the results of the aggregation operation, for example, the maximum/minimum, sum, average (converted to calculate the sum of each server Sum1,sum2,,,sumn and Num1,num2,,numn, Average avg= (Sum1+sum2+,,,+sumn)/(Num1+num2+,,+numn)) and so on.

example, using MapReduce to simulate count, the number of doc in the statistics collection

Step1, defining the map function and the reduce function

For each doc, return key and a doc:{count:1} directly

map=function  () {for (var. This) {  emit (Key,{count: 1});}} Reduce=function  (key,emits) {total =0;  for (var in emits) {  total + =emits[i].count;} return {"Count": Total};}

Step2, performing a mapreduce operation
Performs a mapreduce operation on the set Foo and returns the Mr Object

Mr=Db.runcommand ({"mapreduce": "foo","map": Map,"reduce": reduce,out:"Count Doc "})

Step3, view the results of the MapReduce calculation

Db[mr.result].find ()

Example 2, count the number of different age in the set foo

Step1, defining map and reduce functions

The function of the map function is to map each doc once, returning the age and {count:1};

After shuffle, each age has a list: [{count:1},{count:1},{count:1},{count:1},,,,,], how many different age,mongodb will call the reduce function, and each time it is called, The key value is different.

The function of the reduce function: a call to MongoDB that aggregates the list of age.

map=function  () {Emit (this. age,{count:1}); Reducefunction  (key,emits) {total =0;  for (var in emits) {   total + =emits[i].count;} return {"Age": Key,count:total};}

Step2, performing MapReduce aggregation operations

Mr=Db.runcommand ({"mapreduce": "foo","map": Map,"reduce": reduce,out:"Count Doc "})

Step3, view the results of the aggregation operation

Db[mr.result].find ()

Example 3, studying the properties of the reduce function

The reduce function has an additive feature that, through multiple invocations, can produce the final accumulated value, for example, the following reduce function calculates the number of keys for any particular key,reduce

function (key,emits) {total =0;  for (var in emits) {   total + =emits[i].count;} return {"Key": Key,count:total};}

Invocation Example: Pass the key is the same, are all "X", each emits is an array, call the reduce function repeatedly, and finally get the cumulative value of key.

R1=reduce ("x", [{Count:1},{count:2}]) R2=reduce ("x", [{Count:3},{count:5}]) R3=reduce ("X", [ R1,R2])

Reference doc:

Aggregation

MongoDB Aggregation Operations

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.