Processing Insert Data deduplication problem in MongoDB

Source: Internet
Author: User

Recently wrote a crawler tool, the data stored in the Web site MongoDB, because the data are duplicated, so I set up the database when the index, the following is my step, the set name is Drugitem,

Here is the collection:


I want to create a unique index for the name field because I want to ensure that name is not duplicated:


In this way I run the program to find that the data is much less than the original set unique index, I carefully review the Discovery program in the Name field is repeated where the stop, this is not the result I want, because the subsequent data has not been completed query. So I deleted the name index that I created originally:


Then remove the data, re-fetch the data in the old way, so that the data is obtained, but the essence of the problem has not been resolved, which contains a lot of duplicate data, so I use a unique index + de-redo operation to get the final result:


The premise is that the collection already contains all the data, so processing will find the Drugitem collection of Chinese documents to reduce, the explanation to remove the duplicate document.

The problem has been solved, but I think it is still not appropriate, do we have to deal with it every time to get the final data? Is it possible to judge when inserting in a program (which may increase the insertion time) or to configure the set in advance? Because I first contact MongoDB, hope to have a master to see after giving guidance.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Processing Insert Data deduplication problem in MongoDB

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.