MongoDB Advanced Two MongoDB aggregation

Source: Internet
Author: User
Tags emit new set prev stock prices

In the previous article we talked about MongoDB advanced query: http://blog.csdn.net/stronglyh/article/details/46817789

This article is about MongoDB aggregation.

A: MongoDB has a lot of aggregation framework, so that the transformation and composition of the document, the main component

Component Category Operators
Filter (filtering) $match
Projection (projecting) $project
Group (grouping $group
Sort (sorting) $sort
Restrictions (limiting) $limit
Skipped (skipping) $skip


If you need to aggregate data, use the aggregate method

Db.collection.aggregate (polymerization conditions);

A single action, passing in a JSON object as a collection condition, such as

Db.users.aggregate ({
$project: {
_id:0,
Name:1,
}
})

If more than one operator is required, pass in an array as a condition, such as

Db.users.aggregate ([
{$skip: 5},
{$project: {_id:0, Name:1,}}
])

1.1: $match Match
$match is used to filter the collection of documents, and then aggregations can be made on the filtered subset of documents.

For example, if you want to do statistics on users in Beijing, you can use {$match: {"area": "BJ"}}. $match can use all the usual query operators ($GT, $LT, $in, and so on). There's a mile away. Note: You cannot use the geospatial operator in $match.

In practice, the "$match" should be placed in front of the pipe as much as possible in actual use. There are two benefits to doing so:

One is to quickly filter out unwanted documents, to reduce the workload of the pipeline;

The second is that if you execute "$match" before casting and grouping, the query can use the index.


1.2: $project Projection
Projection operations in pipelines are more powerful than "normal" queries. You can use $project to extract fields from sub-documents, rename fields, and do some interesting things on those fields.

The simplest "$project" action is to select the desired field from the document. You can specify that a field is included or not, and that its syntax is similar to the second parameter in the query. If you execute the following code on the original collection, the returned result document contains only one "Author" field.

Db.articles.aggregate ({"$project": {"Author": 1, "_id": 0})
By default, this field is returned if the "_id" field is present in the document.

Knock yourself out and look at the results of the operation.

You can also rename a field that has been projected. For example, you can rename the "_id" of each user document to "UserId" in the returned results:

Db.articles.aggregate ({"$project": {"userId": "$_id", "_id": 0}});
The "$fieldname" syntax here is to reference the value of the FieldName field ("_id" in the example above) in the aggregation framework. For example, "$age" is replaced with the contents of the "Age" field (possibly a numeric value, possibly a string), and "$tag. 3" is replaced by the 4th element in the tags array. Therefore, the "$_id" in the example above is replaced by the value of the "_id" field of each document that enters the pipeline.

Note that the "_id" must be explicitly excluded, otherwise the value of this field will be returned two times: "UserId" at one time and marked as "_id" at a time. You can use this technique to generate multiple copies of a field for use in later "$group."

Continue Learning


1.3: $group Grouping
$group actions can group documents according to different values for a specific field. Example:

If you have a student collection and want to divide the students into groups according to the score level, you can group them by the "Grade" field.

If the field you want to group is selected, you can pass the selected field to the _id field of the $group function. For the example above, the corresponding code is as follows:

{"$group": {"_id": "$grade"}}
For example, the result of grouping a student's score level might be:

{"Result": [{"_id": "A +"},{"_id": "A"},{"_id": "A-"},..., {"_id": "F"}], "OK": 1}
Grouping operators

These grouping operators allow for each grouping to be evaluated, resulting in a corresponding result.


1.4: $unwind Split
Split (unwind) splits each value in the array into separate documents.

For example, if you have a blog post with multiple reviews, you can use $unwind to split each comment into a separate document:

Db.blog.findOne ()
{
"_id": ObjectId ("5359f6f6ec7452081a7873d7"),
"Author": "Tom",
"Conments": [
{
"Author": "Mark",
"Date": Isodate ("2014-01-01t17:52:04.148z"),
"Text": "Nice post"
},
{
"Author": "Bill",
"Date": Isodate ("2014-01-01t17:52:04.148z"),
"Text": "I Agree"
}
]
}
Db.blog.aggregate ({"$unwind": "$comments"})

{
"Results":
{
"_id": ObjectId ("5359f6f6ec7452081a7873d7"),
"Author": "Tom",
"Comments": {
"Author": "Mark",
"Date": Isodate ("2014-01-01t17:52:04.148z"),
"Text": "Nice post"
}
},
{
"_id": ObjectId ("5359f6f6ec7452081a7873d7"),
"Author": "Tom",
"Comments": {
"Author": "Bill",
"Date": Isodate ("2014-01-01t17:52:04.148z"),
"Text": "I Agree"
}
}
}
This operator is useful if you want to get a specific subdocument in your query: use "$unwind" to get all the subdocuments, and then use "$match" to get the documents you want. For example, if you want to get all the comments for a particular user (you just need to get a comment and don't need to return to the article that the comment belongs to), it's impossible to use a normal query. However, by extracting, splitting, matching, it is easy:

Db.blog.aggregate ({"$project": {"coomments": "$comments"}},
{"$unwind": "$comments"},
{"$match": {"Comments.author": "Mark"}})
Since the final result is still a "comments" subdocument, you might want to do a projection again to make the output more elegant.


1.5:sort sort
You can sort by any field (or multiple fields) in the same way as the syntax in a normal query. If you want to sort a large number of documents, it is strongly recommended that you sort the first stage of the pipeline, at which point the sort operation can use the index. Otherwise, the sorting process is slow and consumes a lot of memory.

You can use a field that actually exists in the document in the sort, or you can use a field that was renamed when it was cast:

Db.employees.aggregate (
{
"$project": {
"Compensation": {
"$add": ["$salary", "$bonus"]
},
Name:1
}
},
{
"$sort": {"compensation":-1, "Name": 1}
}
)
This example sorts the employees, and the end result is the order of the names from A to Z according to the reward from high to low.

The direction of the sort can be 1 (ascending) and-1 (descending).

Like the previous "$group," $sort is also an operator that cannot use streaming mode. The $sort must also receive all documents before sorting. In the Shard environment, the order is first sorted on each shard, and then the sorting results of each shard are sent to MONGOs for further processing.


1.6: $limit will accept a number N and return the first n documents in the result set.

$skip also accepts a number n, discards the first n documents in the result set, and returns the remaining documents as results. In a "normal" query, if you need to skip a large amount of data, the efficiency of this operator will be very low. This is true in aggregations as well, because it must first match all documents that need to be skipped before discarding those documents.


1.7: Using Pipelines
Try to filter as many documents and fields as possible at the beginning of the pipeline (before performing "$project", "$group" or "$unwind" operations). Pipelines If you do not use data directly from the original collection, you cannot use the index in filtering and sorting. If possible, the aggregation pipeline attempts to sort the operation so that the index can be used efficiently.


Two: Aggregation commands

2.1:count

Count is the simplest aggregation tool that returns the number of documents in a collection:

Db.users.count ()
0
Db.users.insert ({"X": 1})
Db.users.count ()
1
Count will quickly return the total number of documents regardless of how large the collection is.

You can also pass a query document to count, MONGO will calculate the number of query results:

Db.users.insert ({"X": 2})
Db.users.count ()
2
Db.users.count ({"X": 1})
1
The total number of pages is very necessary for the paging display: "A total of 439, currently showing 0~10." However, increasing the query condition causes count to become slower. Count can use the index, but the index does not have enough metadata to provide count usage, so it is not as fast as using the query directly.


2.2:distinct

Distinct is used to find out all the different values for a given key. You must specify the collection and key when you use it.

Db.runcommand ({"distinct": "People", "key": "Age"})
Assume that the following document is in the collection

{Name: "Ada", age:20}
{Name: "Fred", age:35}
{Name: "Susan", age:60}
{Name: "Andy", age:35}
If you use distinct for the "age" key, you get all the different ages:

Db.runcommand ({"distinct": "People", "key": "Age"})
{"Values": [20,35,60], "OK": 1}


2.3:group

You can use group to perform more complex aggregations. The key that the group is based on is selected, and MongoDB divides the collection into groups according to the different values of the selected key. You can then aggregate the documents within each grouping to get a result document.

If you are familiar with SQL, then this group is similar to group by in SQL.
Suppose you now have a site that tracks stock prices. The price of a stock is updated every few minutes from 10 o'clock in the morning to 4 o'clock in the afternoon and stored in MongoDB. Now the reporting program is going to get close to 30 days. It's easy to do it with group.

The stock collection contains thousands of documents in the following forms:

{"Day": "2010/10/03", "Time": "10/3/2010 03:57:01 GMT-400", "Price": 4.23}
{"Day": "2010/10/04", "Time": "10/4/2010 11:28:39 GMT-400", "Price": 4.27}
{"Day": "2010/10/03", "Time": "10/3/2010 05:00:23 GMT-400", "Price": 4.10}
{"Day": "2010/10/06", "Time": "10/6/2010 05:27:58 GMT-400", "Price": 4.30}
{"Day": "2010/10/04", "Time": "10/4/2010 08:34:50 GMT-400", "Price": 4.01}
The list of results we need should contain the last trading time and price for each day, just like this:

[
{"Time": "10/3/2010 05:00:23 GMT-400", "Price": 4.10}
{"Time": "10/4/2010 11:28:39 GMT-400", "Price": 4.27}
{"Time": "10/6/2010 05:27:58 GMT-400", "Price": 4.30}
]
The collection is grouped by the "Day" field, then the document with the largest value of "time" is found in each grouping, and it is finished by adding it to the result set. The entire process is as follows:

> Db.runcommand ({"group": {
... "NS": "Stocks",
... "Key": "Day",
... "Initial": {"Time": 0},
... "$reduce": function (Doc,prev) {
... if (Doc.time > Prev.time) {
... prev.price = Doc.price;
... prev.time = doc.time;
...        }
... }}})


Three: MapReduce

Good annoying, speaking of this, I want to talk to everyone about the MapReduce in Hadoop, but it is best not to say that the string off, or fraught, in fact, the principle is the same

MapReduce is a programming model for parallel operations of large datasets (larger than 1TB). The concepts "map" and "Reduce", and their main ideas, are borrowed from functional programming languages, as well as the features borrowed from vector programming languages. It is greatly convenient for programmers to run their own programs on distributed systems without distributed parallel programming. The current software implementation is to specify a map function that maps a set of key-value pairs into a new set of key-value pairs, specifying the concurrent reduce (return) function, which is used to guarantee that each of the mapped key-value pairs share the same set of keys.

3.1: Find all the keys in the collection
MongoDB does not have a pattern, so you don't know how many keys are in each document. It is common to find all the keys of a collection in the form of MapReduce. In the mapping phase, you want each key in the document. The map function uses emit to return the value to be processed. Emit will give the MapReduce a key and a value. This returns the count ({count:1}) for a key in the document with emit. We count each key individually, so we call the emit once for each key in the document. This is a reference to the current document:

> Map=function () {
... for (var key in this) {
... emit (Key,{count:1})
... }};
This returns a large number of {count:1} documents, each associated with a key in the collection. This array of one or more {count:1} documents is passed to the reduce function. The reduce function has two parameters, one is key, which is the first value returned by emit , another parameter is an array, consisting of one or more {count:1} documents corresponding to the key.

> reduce=function (key,emits) {
... total=0;
... for (var i in emits) {
... total+=emits[i].count;
... }
... return {count:total};
... }
Reduce can be called repeatedly, whether it is a mapping link or a previous simplification link. The document returned by reduce must be able to act as an element of the second parameter of reduce. If the X key is mapped to 3 documents {"Count": 1,id:1},{"Count": 1,id:2},{"Count": 1,id:3} where the ID key is used for the difference. MongoDB might call reduce like this:

>r1=reduce ("x", [{"Count": 1,id:1},{"Count": 1,id:2}])
{Count:2}
>r2=reduce ("x", [{"Count": 1,id:3}])
{Count:1}
>reduce ("x", [R1,R2])
{Count:3}
You cannot assume that the second argument is always one of the original documents (for example {count:1}) or fixed in length. Reduce should be able to handle various combinations of emit documents and other reduce return results.

In summary, the MapReduce function might be the following:

> Mr = Db.runcommand ({"MapReduce": "foo", "map": Map, "Reduce": reduce})
{
"Reduce": "tmp.mr.mapreduce_1266787811_1",//This is the storage of MapReduce result set name, temporary collection connection off auto Delete
"Timemillis": 12,//time spent on operation, per millisecond
"Count": {
"Input": 6//Number of documents destined to map function
"Emit": 14//number of times emit is called in the map function
"Output": 5//Number of documents in the result set
},
"OK": true
}



3.2: Web page classification
We have a Web site where users can submit their favorite link URLs, such as http://www.hubwiz.com, and the submitter can add some tags to the URL as a subject that other users can rate for this message. We have a collection that collects this information, and then we need to see which topic is the most popular, and the popularity is determined by the latest scoring date and the given score.

First, create a map function that emits a (emit) tag and a value based on popularity and the old and new levels.

> map = function () {
... for (var i in this.tags) {
... var recency = 1/(New Date ()-this.date);
... var score = recency * This.score;
... emit (This.tags[i], {"URLs": [This.url], "Score": This.score});
...     }
... };
Now simplify all the values of the same tag to get the score for this tag:

> reduce = function (key, emits) {
... var total = {"URLs": [], "Score": 0};
... for (var i in emits) {
... emits[i].urls.foreach (function (URL) {
... total.urls.push (URL);
...         });
... Total.score + = Emits[i].score;
...     }
.. return total;
... };


3.2:mongodb and MapReduce
The previous two examples used only the MapReduce, map, and reduce keys. These 3 keys are required, but the MapReduce command also has a number of optional keys.

"Finalize": function

The result of reduce is sent to this key, which is the last step in the processing process.

"Keeplize": Boolean

If the value is true, the temporary result collection is saved when the connection is closed, otherwise it is not saved.

"Output": string

The name of the output collection, and if this is set, the system will automatically set the Keeptemp:true.

"Query": Document

Filter the document with the specified criteria before sending it to the map function.

"Sort": Document

Sort the document before sending it to the map function (useful with limit).

"Limit": Integer

The upper limit of the number of documents that are sent to the map function.

"Scope": Document

You can re-use the variables in JavaScript code.

"Verbose": Boolean

Whether to log verbose server logs.






Thank you for your wisdom network: http://hubwiz.com/


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

MongoDB Advanced Two MongoDB aggregation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.