How to query for duplicate data records in MongoDB using aggregate _mongodb

Source: Internet
Author: User
Tags mongodb

MongoDB Aggregation (aggregate) is used primarily to process data (such as statistical averages, sums, and so on) and to return computed data results. Somewhat similar to the count (*) in an SQL statement.

Aggregate () method

Methods of aggregation in MongoDB use aggregate ().

Grammar

The basic syntax format for the aggregate () method is as follows:

>db. Collection_name.aggregate (aggregate_operation)

We know that MongoDB belongs to a document type database and that its stored document types are JSON objects. It is because of this feature that we often use MONGODB for data access in Node.js. But since Node.js is executed asynchronously, this leads us to be unable to guarantee that each time the database save operation is atomic. That is, if the client initiates the same event twice in a row and stores the data in the database, it is likely to cause the data to be saved repeatedly. High concurrency, even if you have done a very strict validation in the code, such as the insertion of data to determine whether the data to be saved already exists, but there is still the risk of data can be repeatedly saved. Because in asynchronous execution, you have no way of ensuring which thread executes first, which thread executes, and all requests initiated by the client are not executed sequentially as we think. A better solution is to create a unique index in all tables in the MONGO database. In fact, MongoDB creates a unique index (can be canceled) for all tables by default for a _id field. If you want to create an index from Node.js in the Mongoose.schema, you can refer to the following code:

var mongoose = require (' Mongoose ');
var Schema = Mongoose. Schema;
var customerschema = new Mongoose. Schema ({
cname:string,
cellPhone, String,
sender:string,
tag:string,
behaviour:number,
Createtime: {
type:date,
default:Date.now
},
current:{
Type:boolean,
default:true
}
}, {
versionkey:false
});

In the model above we defined the structure of the table customer and created a unique index on the field Cname,cellphone,sender,tag,behaviour through the index () method, so that when the duplicate data containing the fields is inserted, The database throws an exception. To borrow Mongoose, if the database table has been created before and the program is running, when we modify the model and add an index, and then restart the app, Mongoose automatically detects and creates an index whenever there is access to the model. Of course, if the data is duplicated, the index creation fails. At this point, we can have the database automatically delete duplicate data by adding the dropdups option when we create the index, such as:

Customerschema.index ({cname:1,cellphone:1,sender:1,tag:1,behaviour:1}, {unique:true, dropdups:true});

However, according to MongoDB, the version is no longer used since 3.0 and does not provide an alternative solution. Looks like the authorities no longer provide the ability to automatically delete duplicate records when creating an index. How can you quickly and efficiently find duplicate records and delete them? First we have to find these records and then delete them through the Remove () method. The following query statement finds records with duplicate data for a given field:

Db.collection.aggregate ([
{$group: { 
_id: {firstfield: "$firstField", Secondfield: "$secondField"}, 
UniqueIDs: {$addToSet: ' $_id '},
count: {$sum: 1}}} 
, 
{$match: { 
count: {$gt: 1} 
}}

Replace the value of the _id property to specify the field you want to determine. Accordingly, the code in Node.js is as follows:

var deferred = Q.defer ();
var group = {Firstfield: "$firstField", Secondfield: "$secondField"};
Model.aggregate (). Group ({
_id:group,
uniqueids: {$addToSet: ' $_id '},
count: {$sum: 1}
}). Match ({count: {$gt: 1}}). EXEC (Deferred.makenoderesolver ());

The preceding code uses Q to replace the callback in the function execution. In Node.js asynchronous programming, using Q to handle callbacks is a good choice.

The following is the result of the return:

* * 1/
{
"result": [ 
{
"_id": {"
cellPhone": "15827577345",
"Actid": ObjectId (" 5694565fa50fea7705f01789 ")
},
" UniqueIDs ": [ 
ObjectId (" 569b5d03b3d206f709f97685 "), 
ObjectId (" 569b5d01b3d206f709f97684 ")
],
" Count ": 2.0000000000000000
}, 
{
" _id ": {
" CellPhone ":" 18171282716 ","
Actid ": ObjectId (" 566b0d8dc02f61ae18e68e48 ")
},
" UniqueIDs ": [ 
ObjectId (" 566d16e6cf86d12d1abcee8b "), 
ObjectId (" 566d16e6cf86d12d1abcee8a ")
],
" Count ": 2.0000000000000000
}
],
"OK": 1.0000000000000000
}

As you can see from the results, a total of two sets of data have the same record, so the length of the returned result array is 2. The UniqueIDs property is an array that holds the value of the _id field of the duplicate record, through which we can use the Remove () method to find and delete the corresponding data.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.