Introduction and use of aggregation tools aggregate in MongoDB

Source: Internet
Author: User
Tags emit getdate



Aggregate is the more important of the many tools that MongoDB provides, similar to the one in SQL statementsGROUP BY. The aggregation tool allows developers to manipulate data in the database directly using MongoDB native commands, and aggregates as required.



MongoDB provides three ways to perform aggregations: Aggregation pipleline,map-reduce function and single Purpose Aggregation Operations



Some of the functions used to do the aggregation operations are


    • aggregate(pipeline,options)Specify the keys for group, implement simple reduce by operator, and$push/$addToSet/$sumdo not support function/custom variables

    • group({ key, reduce, initial [, keyf] [, cond] [, finalize] })keyfmapReducecastrated version of the support function ()

    • mapReduce

    • count(query)

    • distinct(field,query)

1, Aggregation pipleline

MongoDB ' s aggregation framework is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents to an aggregated result.


The pipeline outputs data from the previous command in *nix as a parameter to the next command. Pipeline aggregation in MongoDB is very practical, provides efficient data aggregation, and is the preferred method for data aggregation in MongoDB



The Official Chart:








[    {$match: {status: ' A '}},    {$group: {_id: ' $cust _id ', total: {$sum: ' $amount '}}]


Aggreagte is an array that contains multiple objects (commands) that manipulate the data in the collection by traversing the Pipleline array.



$match: Query criteria



$group: Aggregation Configuration


    • _idRepresents the primary key of the data you want to aggregate, and in the above data, you want to aggregate the sum of all thecust_idsame entries, which isamount_idset tocust_id._idas a must , you can fill in a null value.

    • totalRepresents one of the last data you want to output, and heretotalisamountthe sum of each result.

    • $sumis an aggregated operator, and the other operator you can find in the official documentation. The command in the _id represents the sum under the same primary key (*amount. If you want to calculate the number of occurrences of the primary key, you can write the command in the following form{$sum: 1}

the process of aggregation


Looking at a case, all the data is$matchordered first, leaving only thestatusdata for a, then aggregating the filtered data, calculating the sum of the same cust_id dataamount, and finally outputting the result.


second, aggregate specific introduction


Accepts two parameterspipeline/options,pipelineis an array, the same operator can be used multiple times



Methods supported by pipeline


  • $geoNearThe geonear command returns the distance from the query point to each point in the query results

  • $groupSpecify group_id(Key/keys) and additive operations based on operators ($push/$sum/$addToSet/...)

  • $limitRestriction conditions

  • $matchEnter Filter criteria

  • $outSave the output tocollection

  • $projectModifying the structure of a document in a data flow

  • $redactYes$project/$matchmerge of functions

  • $skip 跳过

  • $sortSort the results

  • $unwinddisassembling data


$groupallowable additive operators///////$addToSet$avg$first$last$max$min$push$sum,不被允许的累加操作符$each... ,默认最多可以用 100MB RAM, 增加allowDiskUse可以让$group操作更多的数据



Here is the usage of aggregate


Db.newtest.aggregate ([{$match: {}}, {$skip: 10},//Skip the first 10 rows of collection {$project: {group:1, datetime:1, Cat Egory:1, Count:1}},//If the last result of {count:1} is not selected Count_all/count_avg = 0 {$redact: {//Redact simple usage Filter Group! = ' A ' Line $cond: [{$eq: ["$group", "A"]}, "$ $DESCEND", "$ $PRUNE"}}, {$group: {_id: {year: {$year: "$dateti Me "}, Month: {$month:" $datetime "}, Day: {$dayOfMonth:" $datetime "}, Group_unique: {$addToSet:" $group "}, C        Ategory_first: {$first: "$category"}, Category_last: {$last: "$category"}, Count_all: {$sum: "$count"}, Count_avg: {$avg: "$count"}, Rows: {$sum: 1}},//Split group_unique If this option is turned on, it causes _id to repeat and cannot write to the coll specified by the Out Ection, unless $group again//{$unwind: "$group _unique"},//Keep only these two fields {$project: {group_unique:1, rows:1}},//Results Sort {$sort by _id {"_id": 1},////Only 50 results//{$limit: 50},//Result save {$out: "Data_agg_out"},], {Explai N:true, Allowdiskuse: True, cursor: {batchsize:0}}) Db.data_agg_out.find () db.data_agg_out.aggregate ([{$group: {_id:null, Rows: {$sum: ' $rows '}}]) Db.data_agg_out.drop ()
  • $matchData filtering before aggregation

  • $skipSkips the n rows of the data set before the aggregation, if{$skip: 10}, finallyrows = 5000000 - 10

  • $projectSelect the field you want, except for the value of a_idfield other than 1

  • $redactLook at the document is unclear about its actual use of the scene, here is simply filtering the data before aggregation

  • $groupSpecify the cumulative method for each field

  • $unwindSplits the value of the array field, which causes_idduplicates

  • $projectReusable multiple times last to filter the fields you want to store

  • $outIf$group/$project/$redact_iddo not repeat the error

  • The above method$project/$redact/$group/$unwindcan be used multiple times

Second, group


groupBetter thanaggregatea place to bemap/reducesupported by thefunctiondefinition, here are the supported options


    • NS  if  Db.runcommand ({Group {}})  call, need  ns  Specify collection
    • cond  Pre-aggregation filter
    • key  Aggregation key
    • initial  Initialize cumulative result
    • $reduce  Accept  (Curr, Result)  parameter,  curr  accumulate to  result
    • keyf  replace  key  primary key for PLA with function
    • finalize  result processing


Requires guaranteed output less than 16MB becausegroupno dump option is available


Db.data.group ({    cond: {' group ': ' A '},    //key: {' Group ': 1, ' Category ': 1},    keyf:function (doc) {        var dt = New Date (doc.created);        or        //var dt = doc.datetime;        return {            year:doc.datetime.getFullYear (),            month:doc.datetime.getMonth () + 1, day            : Doc.datetime.getDate ()        }    ,    initial: {count:0, Category: []},    $reduce: function (Curr, result) { C17/>result.count + = Curr.count;        if (result.category.indexOf (curr.category) = =-1) {            result.category.push (curr.category);        }    },    Finalize:function (Result) {        result.category = Result.category.join ();    }})


If you want to aggregate large amounts of data, you need to usemapReduce


Third, MapReduce
  • queryPre-aggregation filtering
  • sortSorting data before aggregation is used to optimize the reduce
  • limitRestricting access to map data
  • map(function) Emit (key, value) specifies the k/v of the aggregation in the function
  • reduceThe (function) parameter(key, values)keyis defined in map, whichvaluesis the all V array under this K
  • finalizeProcess the final result
  • outResult dump can select a different db
  • scopeSetting Global variables
  • jdMode(false) whether (by default) the Map/reduce intermediate result is converted to BSON format, and the BSON format can take advantage of disk space so that large datasets can be processed
  • verbose(true) More information


If setjsModeto True without BSON conversion, you can optimize the execution speed of reduce, but the maximum memory limit is used when the number of emit is less than 500,000



Need to be aware when writing MapReduce


    • emit must be consistent with the value structure returned by reduce
    • reduce  functions must be idempotent
    • see   Troubleshoot the Reduce Function
Db.data.mapReduce (function () {var d = this.datetime;    var key = {Year:d.getfullyear (), Month:d.getmonth () + 1, day:d.getdate (),}; var value = {Count:this.count, rows:1, Groups: [This.group],} emit (key, value);}, Function (    Key, Vals) {var reducedval = {count:0, groups: [], rows:0,};        for (var i = 0; i < vals.length; i++) {var v = vals[i];        Reducedval.count + = V.count;        Reducedval.rows + = V.rows;                for (var j = 0; J < V.groups.length; J + +) {if (ReducedVal.groups.indexOf (v.groups[j]) = =-1) {            ReducedVal.groups.push (V.groups[j]); }}} return reducedval;}, {query: {}, Sort: {datetime:1},//required index Otherwise the result returns an empty limit:50000, fin        Alize:function (key, reducedval) {reducedval.avg = reducedval.count/reducedval.rows;    return reducedval; }, out: {inline:1,//replace: "",//Merge: "",//Reduce: "",}, Scope: {}, jsmode:true}) 
db.newtest.aggregate ([
    {$ match: {}},
    {$ skip: 10}, // skip the first 10 lines of the collection
    {$ project: {group: 1, datetime: 1, category: 1, count: 1}},
    // If {count: 1} is not selected, count_all / count_avg = 0 in the final result
    {$ redact: {// simple redact usage filter group! = ‘A’ lines
        $ cond: [{$ eq: ["$ group", "A"]}, "$$ DESCEND", "$$ PRUNE"]
    }},
    {$ group: {
        _id: {year: {$ year: "$ datetime"}, month: {$ month: "$ datetime"}, day: {$ dayOfMonth: "$ datetime"}},
        group_unique: {$ addToSet: "$ group"},
        category_first: {$ first: "$ category"},
        category_last: {$ last: "$ category"},
        count_all: {$ sum: "$ count"},
        count_avg: {$ avg: "$ count"},
        rows: {$ sum: 1}
    }},
    // split group_unique If this option is turned on, it will cause _id to be repeated and cannot be written to the specified collection unless it is $ group again
    // {$ unwind: "$ group_unique"},
    // keep only these two fields
    {$ project: {group_unique: 1, rows: 1}},
    // Sort results by _id
    {$ sort: {"_id": 1}},
    // keep only 50 results
    // {$ limit: 50},
    // Save results
    {$ out: "data_agg_out"},
], {
    explain: true,
    allowDiskUse: true,
    cursor: {batchSize: 0}
})
db.data_agg_out.find ()
db.data_agg_out.aggregate ([
    {$ group: {
        _id: null,
        rows: {$ sum: ‘$ rows’}
    }}
])
db.data_agg_out.drop () 
Iv. Summary out
Method Allowdiskuse function
aggregate pipeline/collection false
group false pipeline true
mapreduce jsmode pipeline/collection true
    • aggregateAggregate-based aggregation can be reused/layer-by-layer$project$groupaggregated data that can be used for large amounts of data (single output less than 16MB) is not available for Shard data
    • mapReduceThe ability to handle very large datasets requires strict adherence to the structure-consistent/idempotent notation in MapReduce, which can be incrementally output/merged, seeoutOptions
    • groupSimple requirements in the RDB aregroup byavailable (only the inline output) generatesread lock


Introduction and use of aggregation tools aggregate in MongoDB


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.