How to query for duplicate data records in MongoDB using aggregate

How to query for duplicate data records in MongoDB using aggregate _mongodb

Last Update:2017-01-18 Source: Internet

Author: User

Tags mongodb

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MongoDB Aggregation (aggregate) is used primarily to process data (such as statistical averages, sums, and so on) and to return computed data results. Somewhat similar to the count (*) in an SQL statement.

Aggregate () method

Methods of aggregation in MongoDB use aggregate ().

Grammar

The basic syntax format for the aggregate () method is as follows:

>db. Collection_name.aggregate (aggregate_operation)

We know that MongoDB belongs to a document type database and that its stored document types are JSON objects. It is because of this feature that we often use MONGODB for data access in Node.js. But since Node.js is executed asynchronously, this leads us to be unable to guarantee that each time the database save operation is atomic. That is, if the client initiates the same event twice in a row and stores the data in the database, it is likely to cause the data to be saved repeatedly. High concurrency, even if you have done a very strict validation in the code, such as the insertion of data to determine whether the data to be saved already exists, but there is still the risk of data can be repeatedly saved. Because in asynchronous execution, you have no way of ensuring which thread executes first, which thread executes, and all requests initiated by the client are not executed sequentially as we think. A better solution is to create a unique index in all tables in the MONGO database. In fact, MongoDB creates a unique index (can be canceled) for all tables by default for a _id field. If you want to create an index from Node.js in the Mongoose.schema, you can refer to the following code:

var mongoose = require (' Mongoose ');
var Schema = Mongoose. Schema;
var customerschema = new Mongoose. Schema ({
cname:string,
cellPhone, String,
sender:string,
tag:string,
behaviour:number,
Createtime: {
type:date,
default:Date.now
},
current:{
Type:boolean,
default:true
}
}, {
versionkey:false
});

In the model above we defined the structure of the table customer and created a unique index on the field Cname,cellphone,sender,tag,behaviour through the index () method, so that when the duplicate data containing the fields is inserted, The database throws an exception. To borrow Mongoose, if the database table has been created before and the program is running, when we modify the model and add an index, and then restart the app, Mongoose automatically detects and creates an index whenever there is access to the model. Of course, if the data is duplicated, the index creation fails. At this point, we can have the database automatically delete duplicate data by adding the dropdups option when we create the index, such as:

Customerschema.index ({cname:1,cellphone:1,sender:1,tag:1,behaviour:1}, {unique:true, dropdups:true});

However, according to MongoDB, the version is no longer used since 3.0 and does not provide an alternative solution. Looks like the authorities no longer provide the ability to automatically delete duplicate records when creating an index. How can you quickly and efficiently find duplicate records and delete them? First we have to find these records and then delete them through the Remove () method. The following query statement finds records with duplicate data for a given field:

Db.collection.aggregate ([
{$group: { 
_id: {firstfield: "$firstField", Secondfield: "$secondField"}, 
UniqueIDs: {$addToSet: ' $_id '},
count: {$sum: 1}}} 
, 
{$match: { 
count: {$gt: 1} 
}}

Replace the value of the _id property to specify the field you want to determine. Accordingly, the code in Node.js is as follows:

var deferred = Q.defer ();
var group = {Firstfield: "$firstField", Secondfield: "$secondField"};
Model.aggregate (). Group ({
_id:group,
uniqueids: {$addToSet: ' $_id '},
count: {$sum: 1}
}). Match ({count: {$gt: 1}}). EXEC (Deferred.makenoderesolver ());

The preceding code uses Q to replace the callback in the function execution. In Node.js asynchronous programming, using Q to handle callbacks is a good choice.

The following is the result of the return:

* * 1/
{
"result": [ 
{
"_id": {"
cellPhone": "15827577345",
"Actid": ObjectId (" 5694565fa50fea7705f01789 ")
},
" UniqueIDs ": [ 
ObjectId (" 569b5d03b3d206f709f97685 "), 
ObjectId (" 569b5d01b3d206f709f97684 ")
],
" Count ": 2.0000000000000000
}, 
{
" _id ": {
" CellPhone ":" 18171282716 ","
Actid ": ObjectId (" 566b0d8dc02f61ae18e68e48 ")
},
" UniqueIDs ": [ 
ObjectId (" 566d16e6cf86d12d1abcee8b "), 
ObjectId (" 566d16e6cf86d12d1abcee8a ")
],
" Count ": 2.0000000000000000
}
],
"OK": 1.0000000000000000
}

As you can see from the results, a total of two sets of data have the same record, so the length of the returned result array is 2. The UniqueIDs property is an array that holds the value of the _id field of the duplicate record, through which we can use the Remove () method to find and delete the corresponding data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More