Massive data comparisons to eliminate repetitive solutions
Recently has a Beijing to do the mail marketing friend, his hand many millions of data, needs to do eliminates the duplication processing.
Here are some of the solutions I found in my groping process for your reference: 1: Write your own program to achieve:
This functionality can be implemented, but the technology involved is cumbersome and time-consuming:
1 basic knowledge of set operations
2 Multi-threaded processing
3 text file read-write operation
4 basic operation of a collection or array
Then software debugging ...
2: Looking for the market to mature removal of duplicate software
Found that off-the-shelf software can not meet the requirements, custom development, the software company's quotation is relatively high, a working day needs more than 1000 yuan.
3: Use SQL script to remove duplicates
First, you import 2 text files into the SQL Server database, and then run the following script:
1 query A,b in duplicate mail: Also that is a pay B
Select mail from a where mail in (select mail from B)
2 Query A, there are no messages in B, that is, a difference b
Select mail from a where mail isn't in (select mail from B)
3 in Query B, there is a message that is not in a, that is B difference a
Select mail from a where mail isn't in (select mail from B)
4 Query A,b All the mail, that is, A and B:
Select col001 from a
Union
Select col001 from B
During execution: Memory footprint: 1.3G CPU occupancy: 99%
This method has higher requirements for hardware configuration of computer, and more than 2G memory requirements.
The results show that scenario 3 is a fast and workable solution,