A pit that was trampled on MongoDB's mapreduce.

Source: Internet
Author: User
Tags emit



It's been a long time. Here, life is at a new beginning. This blog content long ago wanted to update up, but has not found the right time point (haha, in fact, lazy), the main content focused on the use of MongoDB when some of the hidden MapReduce problem:



1, reduce the count problem



2, reduce the extraction of data problems



In addition, add a small tips:mongodb to the index established in the priority use fixed instead of using the scope.






First, the problem of the count of MapReduce



This problem mainly occurs when using the idea of "+1" to calculate the cumulative number of times. If the volume of the record is too large in a category after the map, the count will fail.



This is illustrated below:



Raw data (there are 400 identical databases in the results table):{"Grade": 1, "name": "Lekko", "score": Up }



For MapReduce:

1 db.runCommand ({mapreduce: "results",
 2 map: function Map () {
 3 emit (
 4 {grade: this.grade},
 5 {recnum: 1, score: this.score}
 6);
 7},
 8 reduce: function Reduce (key, values) {
 9 var reduced = {recnum: 0, score: 0};
10 values.forEach (function (val) {
11 reduced.score + = val.score;
12 ++ reduced.recnum;
13});
14 return reduced;
15},
16 finalize: function Finalize (key, reduced) {
17 return reduced;
18},
19 out: {inline: 1}
20});
Hopefully value.recnum will output 400, but the result is 101! The value.scorce output is correct: 38000 (95 * 400). I have been wondering here for a long time, and by changing the reduce function: function Reduce (key, values) {return {test: values};}, I found the data is like this:

ForEach in the original Reduce function only traversed the first layer of data, that is, 101, so the ++ operation has only been done 101 times!

After thinking, the key reason for the problem lies in the data format of Bosn after emit in MapReduce. An Array greater than 100 will be split and stored into a non-linear linked list structure.

Then, the addition of scores is correct, and it can be boldly speculated: "reduced.score + = val.score;" The sentence can intelligently find the scores of all the child nodes and add them!

Finally, here is an alternative to counting, modify Reduce ++, use + = operation instead:

1 function Reduce (key, values) {;
2 var reduced = {recnum: 0, score: 0};
3 values.forEach (function (val) {
4 reduced.score + = val.score;
5 reduced.recnum + = val.recnum;
6});
7 return reduced;
8 }

Second, extract the data in Reduce to form an Array



The reason for this problem is similar to the above, but also because the data after emit is non-linear (with a hierarchical relationship) during reduce, so there will be problems when extracting data fields. For testing, insert it into the table mentioned above 3 data:

{"grade": 1, "name": "monkey", "score": 95}, {"grade": 2, "name": "sudan", "score": 95}, {"grade": 2 , "name": "xiaoyan", "score": 95}

Compile a list of all the names (not duplicates) of each grade:

 1 db.runCommand ({mapreduce: "results",
 2 map: function Map () {
 3 emit (
 4 {grade: this.grade},
 5 {name: this.name}
 6);
 7},
 8 reduce: function Reduce (key, values) {
 9 var reduced = {names: []};
10 values.forEach (function (val) {
11 var isExist = false;
12 for (var i = 0; i <reduced.names.length; i ++) {
13 var cur = reduced.names [i];
14 if (cur == val.name) {
15 isExist = true;
16 break;
17}
18}
19 if (! IsExist)
20 reduced.names.push (val.name);
twenty one     });
22 return reduced;
twenty three },
24 finalize: function Finalize (key, reduced) {
25 return reduced;
26},
27 out: {inline: 1}
28});
The return result is:

1 {"_id": {"grade": 1},
2 "value": {"names": [null, "lekko"]}
3},
4 {"_id": {"grade": 2},
5 "value": {"names": ["xiaoyan", "sudan"]}
6}
The newly inserted two data of grade = 2 are normal, but the monkey of grade = 1 is gone! Using the way of thinking in question 1, it must also be traversed to an array object during Reduce, the name value is empty, and it is added, and the monkey object is not accessed at all.

The solution to this problem is to abandon MapReduce and use Group instead:

 1 db.results.group ({
 2 key: {"grade": true},
 3 initial: {names: []},
 4 reduce: function Reduce (val, out) {
 5 var isExist = false;
 6 for (var i = 0; i <out.names.length; i ++) {
 7 var cur = out.names [i];
 8 if (cur == val.name) {
 9 isExist = true;
10 break;
11}
12}
13 if (! IsExist)
14 out.names.push (val.name);
15},
16 finalize: function Finalize (out) {
17 return out;
18}});
In this way, you can get the name non-duplicate set when grade = 1! Although MapReduce is more powerful and faster than Group, it is very risky to extract data from a large number of items (more than 100 items) like this. Therefore, when using MapReduce, try to use only basic operations such as accumulation, accumulation, and multiplication, and do not use operations that may cause risks, such as ++, push, and delete!

 

Three, add a few Tips

1. When using Group or MapReduce, if there is only one element in a category, the Reduce function will not be executed, but the Finalize function will still be executed. At this time, you have to consider the consistency of the result returned by one element and multiple elements in the Finalize function (for example, if you insert a grade = 3 data in question two, do you still have the names set when performing the returned grade = 3? ?).

2. Index efficiency when searching for a range. If the query is a range of values, its index priority is very low. For example, a table test has a large number of elements and fields have ‘committime’ and ‘author’. Two indexes are established: author_1, committime: -1, author: 1. The following test proves the efficiency:

db.test.find ({'committime': {'$ gt': 910713600000, '$ lte': 1410192000000}, 'author': 'lekko'}). hint ({committime: -1, author: 1}) .explain () "millis": 49163
Db.test.find ({‘committime‘: {‘$ gt‘: 910713600000, ‘$ lte‘: 1410192000000}, ‘author‘: ‘lekko’}). Explain () author_1

Please indicate the original address for reprinting: http://www.cnblogs.com/lekko/p/3963418.html

Pitfalls on MongoDB's MapReduce

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.