Recently wrote a crawler tool, the data stored in the Web site MongoDB, because the data are duplicated, so I set up the database when the index, the following is my step, the set name is Drugitem,
Here is the collection:
I want to create a unique index for the name field because I want to ensure that name is not duplicated:
In this way I run the program to find that the data is much less than the original set unique index, I carefully review the Discovery program in the Name field is repeated where the stop, this is not the result I want, because the subsequent data has not been completed query. So I deleted the name index that I created originally:
Then remove the data, re-fetch the data in the old way, so that the data is obtained, but the essence of the problem has not been resolved, which contains a lot of duplicate data, so I use a unique index + de-redo operation to get the final result:
The premise is that the collection already contains all the data, so processing will find the Drugitem collection of Chinese documents to reduce, the explanation to remove the duplicate document.
The problem has been solved, but I think it is still not appropriate, do we have to deal with it every time to get the final data? Is it possible to judge when inserting in a program (which may increase the insertion time) or to configure the set in advance? Because I first contact MongoDB, hope to have a master to see after giving guidance.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Processing Insert Data deduplication problem in MongoDB