A pit that was trampled on MongoDB's mapreduce.

Last Update:2014-09-18 Source: Internet

Author: User

Tags emit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It's been a long time. Here, life is at a new beginning. This blog content long ago wanted to update up, but has not found the right time point (haha, in fact, lazy), the main content focused on the use of MongoDB when some of the hidden MapReduce problem:

1, reduce the count problem

2, reduce the extraction of data problems

In addition, add a small tips:mongodb to the index established in the priority use fixed instead of using the scope.

First, the problem of the count of MapReduce

This problem mainly occurs when using the idea of "+1" to calculate the cumulative number of times. If the volume of the record is too large in a category after the map, the count will fail.

This is illustrated below:

Raw data (there are 400 identical databases in the results table):{"Grade": 1, "name": "Lekko", "score": Up }

For MapReduce:

1 db.runCommand ({mapreduce: "results",
2 map: function Map () {
3 emit (
4 {grade: this.grade},
5 {recnum: 1, score: this.score}
6);
7},
8 reduce: function Reduce (key, values) {
9 var reduced = {recnum: 0, score: 0};
10 values.forEach (function (val) {
11 reduced.score + = val.score;
12 ++ reduced.recnum;
13});
14 return reduced;
15},
16 finalize: function Finalize (key, reduced) {
17 return reduced;
18},
19 out: {inline: 1}
20});
Hopefully value.recnum will output 400, but the result is 101! The value.scorce output is correct: 38000 (95 * 400). I have been wondering here for a long time, and by changing the reduce function: function Reduce (key, values) {return {test: values};}, I found the data is like this:

ForEach in the original Reduce function only traversed the first layer of data, that is, 101, so the ++ operation has only been done 101 times!

After thinking, the key reason for the problem lies in the data format of Bosn after emit in MapReduce. An Array greater than 100 will be split and stored into a non-linear linked list structure.

Then, the addition of scores is correct, and it can be boldly speculated: "reduced.score + = val.score;" The sentence can intelligently find the scores of all the child nodes and add them!

Finally, here is an alternative to counting, modify Reduce ++, use + = operation instead:

1 function Reduce (key, values) {;
2 var reduced = {recnum: 0, score: 0};
3 values.forEach (function (val) {
4 reduced.score + = val.score;
5 reduced.recnum + = val.recnum;
6});
7 return reduced;
8 }

Second, extract the data in Reduce to form an Array

The reason for this problem is similar to the above, but also because the data after emit is non-linear (with a hierarchical relationship) during reduce, so there will be problems when extracting data fields. For testing, insert it into the table mentioned above 3 data:

{"grade": 1, "name": "monkey", "score": 95}, {"grade": 2, "name": "sudan", "score": 95}, {"grade": 2 , "name": "xiaoyan", "score": 95}

Compile a list of all the names (not duplicates) of each grade:

1 db.runCommand ({mapreduce: "results",
2 map: function Map () {
3 emit (
4 {grade: this.grade},
5 {name: this.name}
6);
7},
8 reduce: function Reduce (key, values) {
9 var reduced = {names: []};
10 values.forEach (function (val) {
11 var isExist = false;
12 for (var i = 0; i <reduced.names.length; i ++) {
13 var cur = reduced.names [i];
14 if (cur == val.name) {
15 isExist = true;
16 break;
17}
18}
19 if (! IsExist)
20 reduced.names.push (val.name);
twenty one });
22 return reduced;
twenty three },
24 finalize: function Finalize (key, reduced) {
25 return reduced;
26},
27 out: {inline: 1}
28});
The return result is:

1 {"_id": {"grade": 1},
2 "value": {"names": [null, "lekko"]}
3},
4 {"_id": {"grade": 2},
5 "value": {"names": ["xiaoyan", "sudan"]}
6}
The newly inserted two data of grade = 2 are normal, but the monkey of grade = 1 is gone! Using the way of thinking in question 1, it must also be traversed to an array object during Reduce, the name value is empty, and it is added, and the monkey object is not accessed at all.

The solution to this problem is to abandon MapReduce and use Group instead:

1 db.results.group ({
2 key: {"grade": true},
3 initial: {names: []},
4 reduce: function Reduce (val, out) {
5 var isExist = false;
6 for (var i = 0; i <out.names.length; i ++) {
7 var cur = out.names [i];
8 if (cur == val.name) {
9 isExist = true;
10 break;
11}
12}
13 if (! IsExist)
14 out.names.push (val.name);
15},
16 finalize: function Finalize (out) {
17 return out;
18}});
In this way, you can get the name non-duplicate set when grade = 1! Although MapReduce is more powerful and faster than Group, it is very risky to extract data from a large number of items (more than 100 items) like this. Therefore, when using MapReduce, try to use only basic operations such as accumulation, accumulation, and multiplication, and do not use operations that may cause risks, such as ++, push, and delete!

Three, add a few Tips

1. When using Group or MapReduce, if there is only one element in a category, the Reduce function will not be executed, but the Finalize function will still be executed. At this time, you have to consider the consistency of the result returned by one element and multiple elements in the Finalize function (for example, if you insert a grade = 3 data in question two, do you still have the names set when performing the returned grade = 3? ?).

2. Index efficiency when searching for a range. If the query is a range of values, its index priority is very low. For example, a table test has a large number of elements and fields have ‘committime’ and ‘author’. Two indexes are established: author_1, committime: -1, author: 1. The following test proves the efficiency:

db.test.find ({'committime': {'$ gt': 910713600000, '$ lte': 1410192000000}, 'author': 'lekko'}). hint ({committime: -1, author: 1}) .explain () "millis": 49163
Db.test.find ({‘committime‘: {‘$ gt‘: 910713600000, ‘$ lte‘: 1410192000000}, ‘author‘: ‘lekko’}). Explain () author_1

Please indicate the original address for reprinting: http://www.cnblogs.com/lekko/p/3963418.html

Pitfalls on MongoDB's MapReduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More