Mongodb advanced 2: mongodb aggregation and mongodb advanced Aggregation

Source: Internet
Author: User
Tags class operator stock prices

Mongodb advanced 2: mongodb aggregation and mongodb advanced Aggregation
In the previous article we talked about mongodb's advanced query: http://blog.csdn.net/stronglyh/article/details/46817789

This article introduces mongodb aggregation.

I. mongodb has many aggregation frameworks to implement document transformation and combination, mainly including components.

Component class Operator
Filtering $ match
Projecting $ project
Group (grouping $ group
Sorting $ sort
Limiting $ limit
Skip (skipping) $ skip


To aggregate data, use the aggregate method.

Db. collection. aggregate (aggregation condition );

For a single operation, input a json object as a set condition, as shown in figure

Db. users. aggregate ({
$ Project :{
_ Id: 0,
Name: 1,
}
})

If multiple operators are required, input an array as a condition, as shown in figure

Db. users. aggregate ([
{$ Skip: 5 },
{$ Project :{_ id: 0, name: 1 ,}}
])

1.1: $ match
$ Match is used to filter the document set, and then aggregation can be performed on the subset of the filtered documents.

For example, if you want to make statistics on users in Beijing (BJ), you can use {$ match: {"area": "BJ "}}. "$ Match" can use all common query operators ("$ gt", "$ lt", "$ in", and so on ). Note that the geospatial operator cannot be used in "$ match.

In practice, you should try to put "$ match" in front of the pipeline. There are two advantages:

First, you can quickly filter out unnecessary documents to reduce the workload of pipelines;

Second, if you execute "$ match" before projection and grouping, you can use indexes for queries.


1.2: $ project Projection
Compared with "normal" queries, projection operations in pipelines are more powerful. You can use "$ project" to extract fields from sub-documents, rename fields, and perform interesting operations on these fields.

The simplest "$ project" operation is to select the desired field from the document. You can specify to include or not contain a field. Its syntax is similar to the second parameter in the query. If the following code is executed on the original set, the returned result document contains only one "author" field.

Db. articles. aggregate ({"$ project": {"author": 1, "_ id": 0 })
By default, if the "_ id" field exists in the document, this field will be returned.

Click it and check the running result.

You can also rename the projected fields. For example, you can rename "_ id" of each user document to "userId" in the returned results ":

Db. articles. aggregate ({"$ project": {"userId": "$ _ id", "_ id": 0 }});
The "$ fieldname" syntax here is to reference the value of the fieldname field ("_ id" in the preceding example) in the aggregation framework. For example, "$ age" is replaced with the content of the "age" field (which may be a numeric value or a string ), "$ tag.3" is replaced with the 4th elements in the tags array. Therefore, the "$ _ id" in the above example will be replaced with the value of the "_ id" field for each document that enters the pipeline.

Note: You must specify to exclude "_ id". Otherwise, the value of this field will be returned twice: marked as "userId" at a time and marked as "_ id" at a time ". You can use this technology to generate multiple copies of fields for later use in "$ group.

Continue Learning


1.3: $ group
The $ group operation can group documents based on different values of specific fields. Example:

If you want to divide a student into multiple groups based on the score level, you can group the students according to the "grade" field.

If you select a field to be grouped, you can pass the selected field to the "_ id" Field of the "$ group" function. For the above example, the code is as follows:

{"$ Group": {"_ id": "$ grade "}}
For example, the result of grouping student scores may be:

{"Result": [{"_ id": "A +" },{ "_ id": "A" },{ "_ id ": "-"},..., {"_ id": "F"}], "OK": 1}
Grouping operators

These grouping operators allow the calculation of each group to obtain corresponding results.


1.4: $ unwind split
Unwind can split each value in the array into a separate document.

For example, if you have a blog post with multiple comments, you can use $ unwind to split each comment into an independent document:

Db. blog. findOne ()
{
"_ Id": ObjectId ("5359f6f6ec7452081a7873d7 "),
"Author": "Tom ",
"Conments ":[
{
"Author": "Mark ",
"Date": ISODate ("2014-01-01T17: 52: 04.148Z "),
"Text": "Nice post"
},
{
"Author": "Bill ",
"Date": ISODate ("2014-01-01T17: 52: 04.148Z "),
"Text": "I agree"
}
]
}
Db. blog. aggregate ({"$ unwind": "$ comments "})

{
"Results ":
{
"_ Id": ObjectId ("5359f6f6ec7452081a7873d7 "),
"Author": "Tom ",
"Comments ":{
"Author": "Mark ",
"Date": ISODate ("2014-01-01T17: 52: 04.148Z "),
"Text": "Nice post"
}
},
{
"_ Id": ObjectId ("5359f6f6ec7452081a7873d7 "),
"Author": "Tom ",
"Comments ":{
"Author": "Bill ",
"Date": ISODate ("2014-01-01T17: 52: 04.148Z "),
"Text": "I agree"
}
}
}
If you want to get a specific sub-document in the query, this operator is very useful: first use "$ unwind" to get all sub-documents, and then use "$ match" to get the desired documents. For example, if you want to get all comments from a specific user (you only need to get comments, and you do not need to return the articles to which the comments belong), it is impossible to use a common query. However, it is easy to extract, split, match, and so on:

Db. blog. aggregate ({"$ project": {"coomments": "$ comments "}},
{"$ Unwind": "$ comments "},
{"$ Match": {"comments. author": "Mark "}})
Since the final result is still a "comments" subdocument, you may want to perform another projection to make the output more elegant.


1.5: sort sorting
It can be sorted by any field (or multiple fields), which is the same as the syntax in common queries. If you want to sort a large number of documents, we strongly recommend that you sort the documents in the first stage of the pipeline. In this case, you can use indexes for sorting operations. Otherwise, the sorting process will be slow and will occupy a large amount of memory.

You can use the actual fields in the document in sorting, or use the fields renamed during projection:

Db. employees. aggregate (
{
"$ Project ":{
"Compensation ":{
"$ Add": ["$ salary", "$ bonus"]
},
Name: 1
}
},
{
"$ Sort": {"compensation":-1, "name": 1}
}
)
This example sorts employees. The final result is the order of compensation from high to low and name from A to Z.

The sorting direction can be 1 (ascending) and-1 (descending ).

Like the "$ group" mentioned above, "$ sort" is also an operator that cannot use streaming methods. "$ Sort" must receive all documents before sorting. In the sharding environment, sort the shards first, and then send the sorting results of each shard to mongos for further processing.


1.6: $ limit accepts a number n and returns the first n documents in the result set.

$ Skip also accepts the number n. The first n documents in the result set are discarded and the remaining documents are returned as results. In a "normal" query, if you need to skip a large amount of data, the efficiency of this operator will be very low. This is also true in aggregation, because it must first match all documents that need to be skipped and then discard these documents.


1.7: Use Pipelines
Filter out as many documents and fields as possible at the beginning of the MPs Queue (before performing the "$ project", "$ group", or "$ unwind" operation. If the MPs queue does not use data directly from the original collection, indexes cannot be used in filtering and sorting. If possible, the aggregate MPs queue tries to sort operations so that indexes can be effectively used.


Ii. Aggregate commands

2.1: count

Count is the simplest aggregation tool used to return the number of documents in the Set:

Db. users. count ()
0
Db. users. insert ({"x": 1 })
Db. users. count ()
1
No matter how large the set is, count will return the total number of documents quickly.

You can also pass a query document to count. Mongo calculates the number of query results:

Db. users. insert ({"x": 2 })
Db. users. count ()
2
Db. users. count ({"x": 1 })
1
It is necessary to display the total number of pages: "A total of 439, currently 0 ~ 10 ". However, increasing the query condition slows down the count. Count can use indexes, but the index does not have enough metadata to provide the count usage. Therefore, it is better to directly use the query distinct function.


2.2: distinct

Distinct is used to identify all different values of a given key. Set and key must be specified during use.

Db. runCommand ({"distinct": "people", "key": "age "})
Assume that the set contains the following documents:

{Name: "Ada", age: 20}
{Name: "Fred", age: 35}
{Name: "Susan", age: 60}
{Name: "Andy", age: 35}
If distinct is used for the "age" Key, different ages are obtained:

Db. runCommand ({"distinct": "people", "key": "age "})
{"Values": [20, 35, 60], "OK": 1}


2.3: group

You can use group to perform more complex aggregation. First, select the key for the group, and then MongoDB will divide the set into several groups based on the different values of the selected key. Then, you can aggregate the documents in each group to obtain a result document.

If you are familiar with SQL, the group is similar to the group by statement in SQL.
Suppose there is a site that tracks stock prices. The price of a stock will be updated every few minutes from ten o'clock A.M. to four o'clock P.M. and saved in MongoDB. Now the report program needs to get the closing price for the last 30 days. With group, you can easily do this.

The stock collection contains thousands of documents in the following forms:

{"Day": "2010/10/03", "time": "10/3/2010 03:57:01 GMT-400", "price": 4.23}
{"Day": "2010/10/04", "time": "11:28:39 GMT-400", "price": 4.27}
{"Day": "2010/10/03", "time": "10/3/2010 05:00:23 GMT-400", "price": 4.10}
{"Day": "2010/10/06", "time": "10/6/2010 05:27:58 GMT-400", "price": 4.30}
{"Day": "2010/10/04", "time": "08:34:50 GMT-400", "price": 4.01}
The list of results we need should contain the final transaction time and price of each day, as shown below:

[
{"Time": "05:00:23 GMT-400", "price": 4.10}
{"Time": "11:28:39 GMT-400", "price": 4.27}
{"Time": "05:27:58 GMT-400", "price": 4.30}
]
First, the set is grouped by the "day" field. Then, find the document with the maximum "time" value in each group and add it to the result set. The process is as follows:

> Db. runCommand ({"group ":{
... "Ns": "stocks ",
... "Key": "day ",
... "Initial": {"time": 0 },
... "$ Reduce": function (doc, prev ){
... If (doc. time> prev. time ){
... Prev. price = doc. price;
... Prev. time = doc. time;
...}
...}}})
 

Iii. MapReduce

So annoying. I 'd like to talk to you about mapreduce in hadoop, but it's better not to talk about it. Otherwise, it's wrong. In fact, the principle is the same.

MapReduce is a programming model used for parallel operations on large-scale datasets (larger than 1 TB. Concepts "Map" and "Reduce", and their main ideas are borrowed from functional programming languages, there are also features borrowed from Vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map function, which is used to Map a group of key-value pairs into a group of new key-value pairs and specify the concurrent Reduce (reduction) function, it is used to ensure that all key-value pairs mapped share the same key group.

3.1: Find all keys in the Set
MongoDB has no mode, so it does not know how many keys each document has. Generally, MapReduce is used to find all the keys in the set. In the ing phase, you want to get each key in the document. map function to use emit to return the value to be processed. emit will give MapReduce a key and a value. Here, we use emit to return the count ({count: 1}) of a key in the Document. we separate the count for each key, so we call emit once for each key in the document. This is a reference of the current document:

> Map = function (){
... For (var key in this ){
... Emit (key, {count: 1 })
...}};
In this way, many {count: 1} documents are returned, each of which is related to a key in the set. an array consisting of one or more {count: 1} documents is passed to the reduce function. the reduce function has two parameters: one is the key, that is, the first value returned by emit, and the other is an array consisting of one or more {count: 1} documents corresponding to the key.

> Reduce = function (key, emits ){
... Total = 0;
... For (var I in emits ){
... Total + = emits [I]. count;
...}
... Return {count: total };
...}
Reduce must be called repeatedly, regardless of the ing or previous simplification. The document returned by reduce must be an element of the second reduce parameter. For example, the x key is mapped to three documents: {"count": 1, id: 1}, {"count": 1, id: 2 },{ "count": 1, id: 3} the id key is used for difference. MongoDB may call reduce as follows:

> R1 = reduce ("x", [{"count": 1, id: 1 },{ "count": 1, id: 2}])
{Count: 2}
> R2 = reduce ("x", [{"count": 1, id: 3}])
{Count: 1}
> Reduce ("x", [r1, r2])
{Count: 3}
The second parameter cannot be considered as one of the first documents (for example, {count: 1}) or fixed length. Reduce should be able to process various combinations of emit documents and other reduce return results.

In short, MapReduce functions may be as follows:

> Mr = db. runCommand ({"mapreduce": "foo", "map": map, "reduce": reduce })
{
"Reduce": "tmp. mr. mapreduce_1266787811_1", // This is the name of the set where MapReduce results are stored. The temporary set connection is closed and automatically deleted.
"TimeMillis": 12, // operation time, in milliseconds
"Count ":{
"Input": 6 // Number of documents sent to the map function
"Emit": 14 // Number of emit calls in the map function
"Output": 5 // number of documents in the result set
},
"OK": true
}



3.2: Webpage Classification
We have a website where users can submit their favorite link URLs, such as the http://www.hubwiz.com, And the submitter can add tags for the url as the topic, other users can score this information. We have a collection that collects this information, and then we need to see which topic is most popular. The popularity is determined by the latest score date and the score given.

First, create a map function, Issue (emit) tags, and a value based on popularity and new and old degree.

> Map = function (){
... For (var I in this. tags ){
... Var recency = 1/(new Date ()-this. date );
... Var score = recency * this. score;
... Emit (this. tags [I], {"urls": [this. url], "score": this. score });
...}
...};
Now all values of the same tag are reduced to get the score of this tag:

> Reduce = function (key, emits ){
... Var total = {"urls": [], "score": 0 };
... For (var I in emits ){
... Emits [I]. urls. forEach (function (url ){
... Total. urls. push (url );
...});
... Total. score + = emits [I]. score;
...}
... Return total;
...};


3.2: MongoDB and MapReduce
The preceding two examples only use the MapReduce, map, and reduce keys. These three keys are required, but the MapReduce command also has many optional keys.

"Finalize": Function

Send the reduce result to this key, which is the last step in the processing process.

"Keeplize": Boolean

If the value is true, the temporary result set is saved when the connection is closed. Otherwise, the temporary result set is not saved.

"Output": String

Name of the output set. If this parameter is set, the system automatically sets keeptemp: true.

"Query": Document

Before sending the filter to the map function, use the specified condition to filter the document.

"Sort": Document

Sort documents before sending them to the map function (it is useful to use it with limit ).

"Limit": integer

The maximum number of documents sent to the map function.

"Scope": Document

Variables that can be used in Javascript code.

"Verbose": Boolean

Whether to record detailed server logs.






Thanks to huizhi: http://hubwiz.com/


Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.