Massive data contrast eliminate duplicate solutions _ database development

Source: Internet
Author: User
Massive data comparisons to eliminate repetitive solutions

Recently has a Beijing to do the mail marketing friend, his hand many millions of data, needs to do eliminates the duplication processing.

Here are some of the solutions I found in my groping process for your reference: 1: Write your own program to achieve:

This functionality can be implemented, but the technology involved is cumbersome and time-consuming:

1 basic knowledge of set operations

2 Multi-threaded processing

3 text file read-write operation

4 basic operation of a collection or array

Then software debugging ...

2: Looking for the market to mature removal of duplicate software

Found that off-the-shelf software can not meet the requirements, custom development, the software company's quotation is relatively high, a working day needs more than 1000 yuan.

3: Use SQL script to remove duplicates

First, you import 2 text files into the SQL Server database, and then run the following script:

1 query A,b in duplicate mail: Also that is a pay B

Select mail from a where mail in (select mail from B)

2 Query A, there are no messages in B, that is, a difference b

Select mail from a where mail isn't in (select mail from B)

3 in Query B, there is a message that is not in a, that is B difference a

Select mail from a where mail isn't in (select mail from B)

4 Query A,b All the mail, that is, A and B:

Select col001 from a

Union

Select col001 from B

During execution: Memory footprint: 1.3G CPU occupancy: 99%

This method has higher requirements for hardware configuration of computer, and more than 2G memory requirements.

The results show that scenario 3 is a fast and workable solution,

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.