Aggregate is the more important of the many tools that MongoDB provides, similar to the one in SQL statementsGROUP BY. The aggregation tool allows developers to manipulate data in the database directly using MongoDB native commands, and aggregates as required.
MongoDB provides three ways to perform aggregations: Aggregation pipleline,map-reduce function and single Purpose Aggregation Operations
Some of the functions used to do the aggregation operations are
-
aggregate(pipeline,options)Specify the keys for group, implement simple reduce by operator, and$push/$addToSet/$sumdo not support function/custom variables
-
group({ key, reduce, initial [, keyf] [, cond] [, finalize] })keyfmapReducecastrated version of the support function ()
-
mapReduce
-
count(query)
-
distinct(field,query)
1, Aggregation pipleline
MongoDB ' s aggregation framework is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents to an aggregated result.
The pipeline outputs data from the previous command in *nix as a parameter to the next command. Pipeline aggregation in MongoDB is very practical, provides efficient data aggregation, and is the preferred method for data aggregation in MongoDB
The Official Chart:
[ {$match: {status: ' A '}}, {$group: {_id: ' $cust _id ', total: {$sum: ' $amount '}}]
Aggreagte is an array that contains multiple objects (commands) that manipulate the data in the collection by traversing the Pipleline array.
$match: Query criteria
$group: Aggregation Configuration
-
_idRepresents the primary key of the data you want to aggregate, and in the above data, you want to aggregate the sum of all thecust_idsame entries, which isamount_idset tocust_id._idas a must , you can fill in a null value.
-
totalRepresents one of the last data you want to output, and heretotalisamountthe sum of each result.
-
$sumis an aggregated operator, and the other operator you can find in the official documentation. The command in the _id represents the sum under the same primary key (*amount. If you want to calculate the number of occurrences of the primary key, you can write the command in the following form{$sum: 1}
the process of aggregation
Looking at a case, all the data is$matchordered first, leaving only thestatusdata for a, then aggregating the filtered data, calculating the sum of the same cust_id dataamount, and finally outputting the result.
second, aggregate specific introduction
Accepts two parameterspipeline/options,pipelineis an array, the same operator can be used multiple times
Methods supported by pipeline
-
-
$geoNearThe geonear command returns the distance from the query point to each point in the query results
-
-
$groupSpecify group_id(Key/keys) and additive operations based on operators ($push/$sum/$addToSet/...)
-
-
$limitRestriction conditions
-
-
$matchEnter Filter criteria
-
-
$outSave the output tocollection
-
-
$projectModifying the structure of a document in a data flow
-
-
$redactYes$project/$matchmerge of functions
-
-
$skip 跳过
-
-
$sortSort the results
-
-
$unwinddisassembling data
$groupallowable additive operators///////$addToSet$avg$first$last$max$min$push$sum,不被允许的累加操作符$each... ,默认最多可以用 100MB RAM, 增加allowDiskUse可以让$group操作更多的数据
Here is the usage of aggregate
Db.newtest.aggregate ([{$match: {}}, {$skip: 10},//Skip the first 10 rows of collection {$project: {group:1, datetime:1, Cat Egory:1, Count:1}},//If the last result of {count:1} is not selected Count_all/count_avg = 0 {$redact: {//Redact simple usage Filter Group! = ' A ' Line $cond: [{$eq: ["$group", "A"]}, "$ $DESCEND", "$ $PRUNE"}}, {$group: {_id: {year: {$year: "$dateti Me "}, Month: {$month:" $datetime "}, Day: {$dayOfMonth:" $datetime "}, Group_unique: {$addToSet:" $group "}, C Ategory_first: {$first: "$category"}, Category_last: {$last: "$category"}, Count_all: {$sum: "$count"}, Count_avg: {$avg: "$count"}, Rows: {$sum: 1}},//Split group_unique If this option is turned on, it causes _id to repeat and cannot write to the coll specified by the Out Ection, unless $group again//{$unwind: "$group _unique"},//Keep only these two fields {$project: {group_unique:1, rows:1}},//Results Sort {$sort by _id {"_id": 1},////Only 50 results//{$limit: 50},//Result save {$out: "Data_agg_out"},], {Explai N:true, Allowdiskuse: True, cursor: {batchsize:0}}) Db.data_agg_out.find () db.data_agg_out.aggregate ([{$group: {_id:null, Rows: {$sum: ' $rows '}}]) Db.data_agg_out.drop ()
-
-
$matchData filtering before aggregation
-
-
$skipSkips the n rows of the data set before the aggregation, if{$skip: 10}, finallyrows = 5000000 - 10
-
-
$projectSelect the field you want, except for the value of a_idfield other than 1
-
-
$redactLook at the document is unclear about its actual use of the scene, here is simply filtering the data before aggregation
-
-
$groupSpecify the cumulative method for each field
-
-
$unwindSplits the value of the array field, which causes_idduplicates
-
-
$projectReusable multiple times last to filter the fields you want to store
-
-
$outIf$group/$project/$redact_iddo not repeat the error
-
-
The above method$project/$redact/$group/$unwindcan be used multiple times
Second, group
groupBetter thanaggregatea place to bemap/reducesupported by thefunctiondefinition, here are the supported options
- NS if Db.runcommand ({Group {}}) call, need ns Specify collection
- cond Pre-aggregation filter
- key Aggregation key
- initial Initialize cumulative result
- $reduce Accept (Curr, Result) parameter, curr accumulate to result
- keyf replace key primary key for PLA with function
- finalize result processing
Requires guaranteed output less than 16MB becausegroupno dump option is available
Db.data.group ({ cond: {' group ': ' A '}, //key: {' Group ': 1, ' Category ': 1}, keyf:function (doc) { var dt = New Date (doc.created); or //var dt = doc.datetime; return { year:doc.datetime.getFullYear (), month:doc.datetime.getMonth () + 1, day : Doc.datetime.getDate () } , initial: {count:0, Category: []}, $reduce: function (Curr, result) { C17/>result.count + = Curr.count; if (result.category.indexOf (curr.category) = =-1) { result.category.push (curr.category); } }, Finalize:function (Result) { result.category = Result.category.join (); }})
If you want to aggregate large amounts of data, you need to usemapReduce
Third, MapReduce
-
- queryPre-aggregation filtering
-
- sortSorting data before aggregation is used to optimize the reduce
-
- limitRestricting access to map data
-
- map(function) Emit (key, value) specifies the k/v of the aggregation in the function
-
- reduceThe (function) parameter(key, values)keyis defined in map, whichvaluesis the all V array under this K
-
- finalizeProcess the final result
-
- outResult dump can select a different db
-
- scopeSetting Global variables
-
- jdMode(false) whether (by default) the Map/reduce intermediate result is converted to BSON format, and the BSON format can take advantage of disk space so that large datasets can be processed
-
- verbose(true) More information
If setjsModeto True without BSON conversion, you can optimize the execution speed of reduce, but the maximum memory limit is used when the number of emit is less than 500,000
Need to be aware when writing MapReduce
- emit must be consistent with the value structure returned by reduce
- reduce functions must be idempotent
- see Troubleshoot the Reduce Function
Db.data.mapReduce (function () {var d = this.datetime; var key = {Year:d.getfullyear (), Month:d.getmonth () + 1, day:d.getdate (),}; var value = {Count:this.count, rows:1, Groups: [This.group],} emit (key, value);}, Function ( Key, Vals) {var reducedval = {count:0, groups: [], rows:0,}; for (var i = 0; i < vals.length; i++) {var v = vals[i]; Reducedval.count + = V.count; Reducedval.rows + = V.rows; for (var j = 0; J < V.groups.length; J + +) {if (ReducedVal.groups.indexOf (v.groups[j]) = =-1) { ReducedVal.groups.push (V.groups[j]); }}} return reducedval;}, {query: {}, Sort: {datetime:1},//required index Otherwise the result returns an empty limit:50000, fin Alize:function (key, reducedval) {reducedval.avg = reducedval.count/reducedval.rows; return reducedval; }, out: {inline:1,//replace: "",//Merge: "",//Reduce: "",}, Scope: {}, jsmode:true})
db.newtest.aggregate ([
{$ match: {}},
{$ skip: 10}, // skip the first 10 lines of the collection
{$ project: {group: 1, datetime: 1, category: 1, count: 1}},
// If {count: 1} is not selected, count_all / count_avg = 0 in the final result
{$ redact: {// simple redact usage filter group! = ‘A’ lines
$ cond: [{$ eq: ["$ group", "A"]}, "$$ DESCEND", "$$ PRUNE"]
}},
{$ group: {
_id: {year: {$ year: "$ datetime"}, month: {$ month: "$ datetime"}, day: {$ dayOfMonth: "$ datetime"}},
group_unique: {$ addToSet: "$ group"},
category_first: {$ first: "$ category"},
category_last: {$ last: "$ category"},
count_all: {$ sum: "$ count"},
count_avg: {$ avg: "$ count"},
rows: {$ sum: 1}
}},
// split group_unique If this option is turned on, it will cause _id to be repeated and cannot be written to the specified collection unless it is $ group again
// {$ unwind: "$ group_unique"},
// keep only these two fields
{$ project: {group_unique: 1, rows: 1}},
// Sort results by _id
{$ sort: {"_id": 1}},
// keep only 50 results
// {$ limit: 50},
// Save results
{$ out: "data_agg_out"},
], {
explain: true,
allowDiskUse: true,
cursor: {batchSize: 0}
})
db.data_agg_out.find ()
db.data_agg_out.aggregate ([
{$ group: {
_id: null,
rows: {$ sum: ‘$ rows’}
}}
])
db.data_agg_out.drop ()
Iv. Summary out
Method |
Allowdiskuse |
|
function |
aggregate |
pipeline/collection |
false |
group |
false |
pipeline |
true |
mapreduce |
jsmode |
pipeline/collection |
true |
- aggregateAggregate-based aggregation can be reused/layer-by-layer$project$groupaggregated data that can be used for large amounts of data (single output less than 16MB) is not available for Shard data
- mapReduceThe ability to handle very large datasets requires strict adherence to the structure-consistent/idempotent notation in MapReduce, which can be incrementally output/merged, seeoutOptions
- groupSimple requirements in the RDB aregroup byavailable (only the inline output) generatesread lock
Introduction and use of aggregation tools aggregate in MongoDB