Introduction and use of aggregation tools aggregate in MongoDB

Last Update:2017-12-04 Source: Internet

Author: User

Tags emit getdate

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Aggregate is the more important of the many tools that MongoDB provides, similar to the one in SQL statementsGROUP BY. The aggregation tool allows developers to manipulate data in the database directly using MongoDB native commands, and aggregates as required.

MongoDB provides three ways to perform aggregations: Aggregation pipleline,map-reduce function and single Purpose Aggregation Operations

Some of the functions used to do the aggregation operations are

aggregate(pipeline,options)Specify the keys for group, implement simple reduce by operator, and$push/$addToSet/$sumdo not support function/custom variables
group({ key, reduce, initial [, keyf] [, cond] [, finalize] })keyfmapReducecastrated version of the support function ()
mapReduce
count(query)
distinct(field,query)

1, Aggregation pipleline

MongoDB ' s aggregation framework is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents to an aggregated result.

The pipeline outputs data from the previous command in *nix as a parameter to the next command. Pipeline aggregation in MongoDB is very practical, provides efficient data aggregation, and is the preferred method for data aggregation in MongoDB

The Official Chart:

[    {$match: {status: ' A '}},    {$group: {_id: ' $cust _id ', total: {$sum: ' $amount '}}]

Aggreagte is an array that contains multiple objects (commands) that manipulate the data in the collection by traversing the Pipleline array.

$match: Query criteria

$group: Aggregation Configuration

_idRepresents the primary key of the data you want to aggregate, and in the above data, you want to aggregate the sum of all thecust_idsame entries, which isamount_idset tocust_id._idas a must , you can fill in a null value.
totalRepresents one of the last data you want to output, and heretotalisamountthe sum of each result.
$sumis an aggregated operator, and the other operator you can find in the official documentation. The command in the _id represents the sum under the same primary key (*amount. If you want to calculate the number of occurrences of the primary key, you can write the command in the following form{$sum: 1}

the process of aggregation

Looking at a case, all the data is$matchordered first, leaving only thestatusdata for a, then aggregating the filtered data, calculating the sum of the same cust_id dataamount, and finally outputting the result.

second, aggregate specific introduction

Accepts two parameterspipeline/options,pipelineis an array, the same operator can be used multiple times

Methods supported by pipeline

$geoNearThe geonear command returns the distance from the query point to each point in the query results
$groupSpecify group_id(Key/keys) and additive operations based on operators ($push/$sum/$addToSet/...)
$limitRestriction conditions
$matchEnter Filter criteria
$outSave the output tocollection
$projectModifying the structure of a document in a data flow
$redactYes$project/$matchmerge of functions
$skip 跳过
$sortSort the results
$unwinddisassembling data

$groupallowable additive operators///////$addToSet$avg$first$last$max$min$push$sum，不被允许的累加操作符$each... ,默认最多可以用 100MB RAM, 增加allowDiskUse可以让$group操作更多的数据

Here is the usage of aggregate

Db.newtest.aggregate ([{$match: {}}, {$skip: 10},//Skip the first 10 rows of collection {$project: {group:1, datetime:1, Cat Egory:1, Count:1}},//If the last result of {count:1} is not selected Count_all/count_avg = 0 {$redact: {//Redact simple usage Filter Group! = ' A ' Line $cond: [{$eq: ["$group", "A"]}, "$ $DESCEND", "$ $PRUNE"}}, {$group: {_id: {year: {$year: "$dateti Me "}, Month: {$month:" $datetime "}, Day: {$dayOfMonth:" $datetime "}, Group_unique: {$addToSet:" $group "}, C        Ategory_first: {$first: "$category"}, Category_last: {$last: "$category"}, Count_all: {$sum: "$count"}, Count_avg: {$avg: "$count"}, Rows: {$sum: 1}},//Split group_unique If this option is turned on, it causes _id to repeat and cannot write to the coll specified by the Out Ection, unless $group again//{$unwind: "$group _unique"},//Keep only these two fields {$project: {group_unique:1, rows:1}},//Results Sort {$sort by _id {"_id": 1},////Only 50 results//{$limit: 50},//Result save {$out: "Data_agg_out"},], {Explai N:true, Allowdiskuse: True, cursor: {batchsize:0}}) Db.data_agg_out.find () db.data_agg_out.aggregate ([{$group: {_id:null, Rows: {$sum: ' $rows '}}]) Db.data_agg_out.drop ()

$matchData filtering before aggregation
$skipSkips the n rows of the data set before the aggregation, if{$skip: 10}, finallyrows = 5000000 - 10
$projectSelect the field you want, except for the value of a_idfield other than 1
$redactLook at the document is unclear about its actual use of the scene, here is simply filtering the data before aggregation
$groupSpecify the cumulative method for each field
$unwindSplits the value of the array field, which causes_idduplicates
$projectReusable multiple times last to filter the fields you want to store
$outIf$group/$project/$redact_iddo not repeat the error
The above method$project/$redact/$group/$unwindcan be used multiple times

Second, group

groupBetter thanaggregatea place to bemap/reducesupported by thefunctiondefinition, here are the supported options

NS if Db.runcommand ({Group {}}) call, need ns Specify collection
cond Pre-aggregation filter
key Aggregation key
initial Initialize cumulative result
$reduce Accept (Curr, Result) parameter, curr accumulate to result
keyf replace key primary key for PLA with function
finalize result processing

Requires guaranteed output less than 16MB becausegroupno dump option is available

Db.data.group ({    cond: {' group ': ' A '},    //key: {' Group ': 1, ' Category ': 1},    keyf:function (doc) {        var dt = New Date (doc.created);        or        //var dt = doc.datetime;        return {            year:doc.datetime.getFullYear (),            month:doc.datetime.getMonth () + 1, day            : Doc.datetime.getDate ()        }    ,    initial: {count:0, Category: []},    $reduce: function (Curr, result) { C17/>result.count + = Curr.count;        if (result.category.indexOf (curr.category) = =-1) {            result.category.push (curr.category);        }    },    Finalize:function (Result) {        result.category = Result.category.join ();    }})

If you want to aggregate large amounts of data, you need to usemapReduce

Third, MapReduce

queryPre-aggregation filtering
sortSorting data before aggregation is used to optimize the reduce
limitRestricting access to map data
map(function) Emit (key, value) specifies the k/v of the aggregation in the function
reduceThe (function) parameter(key, values)keyis defined in map, whichvaluesis the all V array under this K
finalizeProcess the final result
outResult dump can select a different db
scopeSetting Global variables
jdMode(false) whether (by default) the Map/reduce intermediate result is converted to BSON format, and the BSON format can take advantage of disk space so that large datasets can be processed
verbose(true) More information

If setjsModeto True without BSON conversion, you can optimize the execution speed of reduce, but the maximum memory limit is used when the number of emit is less than 500,000

Need to be aware when writing MapReduce

emit must be consistent with the value structure returned by reduce
reduce functions must be idempotent
see Troubleshoot the Reduce Function

Db.data.mapReduce (function () {var d = this.datetime;    var key = {Year:d.getfullyear (), Month:d.getmonth () + 1, day:d.getdate (),}; var value = {Count:this.count, rows:1, Groups: [This.group],} emit (key, value);}, Function (    Key, Vals) {var reducedval = {count:0, groups: [], rows:0,};        for (var i = 0; i < vals.length; i++) {var v = vals[i];        Reducedval.count + = V.count;        Reducedval.rows + = V.rows;                for (var j = 0; J < V.groups.length; J + +) {if (ReducedVal.groups.indexOf (v.groups[j]) = =-1) {            ReducedVal.groups.push (V.groups[j]); }}} return reducedval;}, {query: {}, Sort: {datetime:1},//required index Otherwise the result returns an empty limit:50000, fin        Alize:function (key, reducedval) {reducedval.avg = reducedval.count/reducedval.rows;    return reducedval; }, out: {inline:1,//replace: "",//Merge: "",//Reduce: "",}, Scope: {}, jsmode:true})

db.newtest.aggregate ([
    {$ match: {}},
    {$ skip: 10}, // skip the first 10 lines of the collection
    {$ project: {group: 1, datetime: 1, category: 1, count: 1}},
    // If {count: 1} is not selected, count_all / count_avg = 0 in the final result
    {$ redact: {// simple redact usage filter group! = ‘A’ lines
        $ cond: [{$ eq: ["$ group", "A"]}, "$$ DESCEND", "$$ PRUNE"]
    }},
    {$ group: {
        _id: {year: {$ year: "$ datetime"}, month: {$ month: "$ datetime"}, day: {$ dayOfMonth: "$ datetime"}},
        group_unique: {$ addToSet: "$ group"},
        category_first: {$ first: "$ category"},
        category_last: {$ last: "$ category"},
        count_all: {$ sum: "$ count"},
        count_avg: {$ avg: "$ count"},
        rows: {$ sum: 1}
    }},
    // split group_unique If this option is turned on, it will cause _id to be repeated and cannot be written to the specified collection unless it is $ group again
    // {$ unwind: "$ group_unique"},
    // keep only these two fields
    {$ project: {group_unique: 1, rows: 1}},
    // Sort results by _id
    {$ sort: {"_id": 1}},
    // keep only 50 results
    // {$ limit: 50},
    // Save results
    {$ out: "data_agg_out"},
], {
    explain: true,
    allowDiskUse: true,
    cursor: {batchSize: 0}
})
db.data_agg_out.find ()
db.data_agg_out.aggregate ([
    {$ group: {
        _id: null,
        rows: {$ sum: ‘$ rows’}
    }}
])
db.data_agg_out.drop ()

Iv. Summary out

Method	Allowdiskuse		function
aggregate	pipeline/collection	false
group	false	pipeline	true
mapreduce	jsmode	pipeline/collection	true

aggregateAggregate-based aggregation can be reused/layer-by-layer$project$groupaggregated data that can be used for large amounts of data (single output less than 16MB) is not available for Shard data
mapReduceThe ability to handle very large datasets requires strict adherence to the structure-consistent/idempotent notation in MapReduce, which can be incrementally output/merged, seeoutOptions
groupSimple requirements in the RDB aregroup byavailable (only the inline output) generatesread lock

Introduction and use of aggregation tools aggregate in MongoDB

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More