Remove duplicate records with SQL statements

Last Update:2015-07-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Huge amount of data (more than million), some of which are all the same, some of the same field, how to effectively remove duplication?

If you want to delete the phone (mobilephone), Phone (Officephone), mail (email) at the same time the same data, has been used this statement for deduplication:

SQL code

Delete From table where ID not in
(Select max(ID) from table Group by Mobilephone,officephone,email)
Or
Delete From table where ID not in
(Select min(id) from table Group by Mobilephone,officephone,email)

[SQL]View Plaincopy

delete from table where id not in
(select < span class= "keyword" >max (id) from table group by mobilephone,officephone,email )
or
delete from table where id not in

(select min (id) from table Group by Mobilephone,officephone,email)

One of the following will be slightly faster. Above this data for 1 million of data efficiency can also, repeat the number of 1/5 in the case of a few minutes to a few 10 minutes, but if the amount of data reached more than 3 million, efficiency dips, if repeated data more points, often dozens of hours to run, and sometimes lock the table run a night can not finish. Helpless had to re-search for new viable methods, and today finally some gains:

Java code

Query the ID of the unique data and import them into the temp table tmp
Select min (id) as mid to TMP from table group by Mobilephone,officephone,email
//Query out the heavy data and insert it into the finally table
INSERT INTO finally select (fields other than ID) from Customers_1 where ID in (select mid from TMP)

[Java]View Plaincopy

Query the ID of the unique data and import them into the temp table tmp
Select min (id) as mid to TMP from table group by Mobilephone,officephone,email
Query out the heavy data and insert it into the finally table

Insert into finally select (fields other than ID) from Customers_1 where ID in (select mid from TMP)

Efficiency comparison: Use Delete method to 5 million data deduplication (1/2 repetition) about 4 hours. 4 hours, a long time.

Use temporary tables to insert 5 million data deduplication (1/2 repetitions) in less than 10 minutes.

In fact, with the deletion method is relatively slow, may be the edge to find the reason for the deletion, and the use of temporary table, you can not duplicate the data ID to be selected to put in the temporary table, and then the table information by the temporary tables of the selected ID, they find to insert into the new table, and then delete the original table, so you can

SQL statement removes duplicate records, gets duplicate records

Finding duplicate data for these fields in a table based on some field names and deleting them as they were inserted depends on order by and Row_num.

Method One is repeated with multiple conditions:

SQL code

Delete tmp from (
Select row_num = Row_number () over (partition by field, field order by time c12>desc)
From table where time > GetDate ()-1
) TMP
where row_num > 1

[SQL]View Plaincopy

Delete tmp from (
Select row_num = Row_number () over (partition by field, field order by time desc)
From table where time > GetDate ()-1
) TMP

where row_num > 1

Method two to remove the weight according to a single condition:

SQL code

Delete From table where primary key ID not in (
Select Max(primary key ID) from table group by need to go to heavy fields having count (need to go to the heavy field) >=1
)

[SQL]View Plaincopy

Delete from table where primary key ID not in (
Select Max (primary key ID) from table group by need to go to heavy field having count (need to go to heavy field) >=1

)
　　

Note: In order to improve the efficiency as above two methods can use temporary table, not in the table can first extract temporary table #tmp,

And then use not exists to do, in order to avoid excessive quantity, can be used to control the volume of the deletion of top

Java code

Delete Top (2) from table
Where NOT EXISTS (select primary Key ID

From #tmp where #tmp. primary key id= table. Primary Key ID)

Remove duplicate records with SQL statements (GO)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Remove duplicate records with SQL statements

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Remove duplicate records with SQL statements

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support