Remove duplicate records with SQL statements

Source: Internet
Author: User

Huge amount of data (more than million), some of which are all the same, some of the same field, how to effectively remove duplication?

If you want to delete the phone (mobilephone), Phone (Officephone), mail (email) at the same time the same data, has been used this statement for deduplication:

SQL code
  1. Delete From table where ID not in
  2. (Select max(ID) from table Group by Mobilephone,officephone,email)
  3. Or
  4. Delete From table where ID not in
  5. (Select min(id) from table Group by Mobilephone,officephone,email)
[SQL]View Plaincopy
    1. delete from   table  where id not in     
    2. (select < span class= "keyword" >max (id)  from  table  group by mobilephone,officephone,email )     
    3. or   
    4. delete from  table  where id  not in    

(select min (id) from table Group by Mobilephone,officephone,email)

One of the following will be slightly faster. Above this data for 1 million of data efficiency can also, repeat the number of 1/5 in the case of a few minutes to a few 10 minutes, but if the amount of data reached more than 3 million, efficiency dips, if repeated data more points, often dozens of hours to run, and sometimes lock the table run a night can not finish. Helpless had to re-search for new viable methods, and today finally some gains:

Java code
    1. Query the ID of the unique data and import them into the temp table tmp
    2. Select min (id) as mid to TMP from table group by Mobilephone,officephone,email
    3. //Query out the heavy data and insert it into the finally table
    4. INSERT INTO finally select (fields other than ID) from Customers_1 where ID in (select mid from TMP)
[Java]View Plaincopy
    1. Query the ID of the unique data and import them into the temp table tmp
    2. Select min (id) as mid to TMP from table group by Mobilephone,officephone,email
    3. Query out the heavy data and insert it into the finally table

Insert into finally select (fields other than ID) from Customers_1 where ID in (select mid from TMP)

Efficiency comparison: Use Delete method to 5 million data deduplication (1/2 repetition) about 4 hours. 4 hours, a long time.

Use temporary tables to insert 5 million data deduplication (1/2 repetitions) in less than 10 minutes.

In fact, with the deletion method is relatively slow, may be the edge to find the reason for the deletion, and the use of temporary table, you can not duplicate the data ID to be selected to put in the temporary table, and then the table information by the temporary tables of the selected ID, they find to insert into the new table, and then delete the original table, so you can

SQL statement removes duplicate records, gets duplicate records

Finding duplicate data for these fields in a table based on some field names and deleting them as they were inserted depends on order by and Row_num.

Method One is repeated with multiple conditions:

SQL code
  1. Delete tmp from (
  2. Select row_num = Row_number () over (partition by field, field order by time c12>desc)
  3. From table where time > GetDate ()-1
  4. ) TMP
  5. where row_num > 1
[SQL]View Plaincopy
    1. Delete tmp from (
    2. Select row_num = Row_number () over (partition by field, field order by time desc)
    3. From table where time > GetDate ()-1
    4. ) TMP

where row_num > 1

Method two to remove the weight according to a single condition:

SQL code
    1. Delete From table where primary key ID not in (
    2. Select Max(primary key ID) from table group by need to go to heavy fields having count (need to go to the heavy field) >=1
    3. )
[SQL]View Plaincopy
    1. Delete from table where primary key ID not in (
    2. Select Max (primary key ID) from table group by need to go to heavy field having count (need to go to heavy field) >=1

)
  

Note: In order to improve the efficiency as above two methods can use temporary table, not in the table can first extract temporary table #tmp,

And then use not exists to do, in order to avoid excessive quantity, can be used to control the volume of the deletion of top

Java code
    1. Delete Top (2) from table
    2. Where NOT EXISTS (select primary Key ID

From #tmp where #tmp. primary key id= table. Primary Key ID)

Remove duplicate records with SQL statements (GO)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.