How to remove duplicate records using SQL statements

Source: Internet
Author: User
Massive Data (more than one million data records), some of which have the same fields and some of which are the same. How can we efficiently remove duplicates? If you want to delete the data of the same Mobile Phone, phone (offline phone), and email at the same time, you have always used this statement to deduplicate the data: deletefrom table whereidnotin (selectmax (id) f

Massive Data (more than one million data records), some of which have the same fields and some of which are the same. How can we efficiently remove duplicates? If you want to delete mobile phone, offline phone, and email data with the same data, you have always used this statement to remove duplicates: delete from table where id not in (select max (id) f

Massive Data (more than one million data records), some of which have the same fields and some of which are the same. How can we efficiently remove duplicates?

If you want to delete mobile phone, offline phone, and email data with the same data, you have always used this statement to remove duplicates:

Delete from table where id not in (select max (id) from Table group by mobilePhone, officePhone, email) ordelete from table where id not in (select min (id) from table group by mobilePhone, officePhone, email)

The following is faster. The above data is more efficient for less than 1 million of the data. In the case of hundreds of thousands of duplicates, the efficiency ranges from several minutes to dozens of minutes. However, if the data volume reaches more than 1/5, the efficiency decreases sharply, if you repeat more data, it may take dozens of hours to finish running. Sometimes the lock table may not be able to run all night. Instead, I had to find a new feasible method. Today I have finally gained some benefits:

// Query the unique data ID and import them to the temporary table tmp. select min (id) as mid into tmp from Table group by mobilePhone, officePhone, email // query the duplicated data and insert it into the finally table insert into finally select (fields other than ID) from customers_1 where id in (select mid from tmp)

Efficiency Comparison: The delete method is used to deduplicate 5 million of data (1/2 duplicates) for about 4 hours. 4 hours, a long time.

It takes less than 10 minutes to insert 5 million data deduplication (1/2 duplicates) to a temporary table.

Remove duplicate records from SQL statements and obtain duplicate records

Search for duplicate data with these fields in the table based on several field names and delete the data according to the insertion time. The condition depends on order by and row_num.

Method 1: Repeat the Processing Based on Multiple conditions:

Delete tmp from (select row_num = row_number () over (partition by field, field order by time desc) from table where time> getdate ()-1) tmpwhere row_num> 1

Method 2 deduplication according to a single condition:

Delete from table where primary key ID not in (select max (primary key ID) from Table group by field having count to be de-duplicated (field to be de-duplicated)> = 1)

Note: To improve efficiency, you can use temporary tables in both of the preceding methods. Tables in not in can extract temporary tables # tmp,

Use not exists to execute the operation. To avoid a large number, you can use Top to control the deletion quantity in batches.

Delete top (2) from table where not exists (select primary key ID from # tmp where # tmp. Primary Key ID = table. Primary Key ID)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.