Huge amount of data (more than million), some of which are all the same, some of the same field, how to effectively remove duplication?
If you want to delete the phone (mobilephone), Phone (Officephone), mail (email) at the same time the same data, has been used this statement for deduplication:
SQL code
- Delete From table where ID not in
- (Select max(ID) from table Group by Mobilephone,officephone,email)
- Or
- Delete From table where ID not in
- (Select min(id) from table Group by Mobilephone,officephone,email)
[SQL]View Plaincopy
- delete from table where id not in
- (select < span class= "keyword" >max (id) from table group by mobilephone,officephone,email )
- or
- delete from table where id not in
(select min (id) from table Group by Mobilephone,officephone,email)
One of the following will be slightly faster. Above this data for 1 million of data efficiency can also, repeat the number of 1/5 in the case of a few minutes to a few 10 minutes, but if the amount of data reached more than 3 million, efficiency dips, if repeated data more points, often dozens of hours to run, and sometimes lock the table run a night can not finish. Helpless had to re-search for new viable methods, and today finally some gains:
Java code
- Query the ID of the unique data and import them into the temp table tmp
- Select min (id) as mid to TMP from table group by Mobilephone,officephone,email
- //Query out the heavy data and insert it into the finally table
- INSERT INTO finally select (fields other than ID) from Customers_1 where ID in (select mid from TMP)
[Java]View Plaincopy
- Query the ID of the unique data and import them into the temp table tmp
- Select min (id) as mid to TMP from table group by Mobilephone,officephone,email
- Query out the heavy data and insert it into the finally table
Insert into finally select (fields other than ID) from Customers_1 where ID in (select mid from TMP)
Efficiency comparison: Use Delete method to 5 million data deduplication (1/2 repetition) about 4 hours. 4 hours, a long time.
Use temporary tables to insert 5 million data deduplication (1/2 repetitions) in less than 10 minutes.
In fact, with the deletion method is relatively slow, may be the edge to find the reason for the deletion, and the use of temporary table, you can not duplicate the data ID to be selected to put in the temporary table, and then the table information by the temporary tables of the selected ID, they find to insert into the new table, and then delete the original table, so you can
SQL statement removes duplicate records, gets duplicate records
Finding duplicate data for these fields in a table based on some field names and deleting them as they were inserted depends on order by and Row_num.
Method One is repeated with multiple conditions:
SQL code
- Delete tmp from (
- Select row_num = Row_number () over (partition by field, field order by time c12>desc)
- From table where time > GetDate ()-1
- ) TMP
- where row_num > 1
[SQL]View Plaincopy
- Delete tmp from (
- Select row_num = Row_number () over (partition by field, field order by time desc)
- From table where time > GetDate ()-1
- ) TMP
where row_num > 1
Method two to remove the weight according to a single condition:
SQL code
- Delete From table where primary key ID not in (
- Select Max(primary key ID) from table group by need to go to heavy fields having count (need to go to the heavy field) >=1
- )
[SQL]View Plaincopy
- Delete from table where primary key ID not in (
- Select Max (primary key ID) from table group by need to go to heavy field having count (need to go to heavy field) >=1
)
Note: In order to improve the efficiency as above two methods can use temporary table, not in the table can first extract temporary table #tmp,
And then use not exists to do, in order to avoid excessive quantity, can be used to control the volume of the deletion of top
Java code
- Delete Top (2) from table
- Where NOT EXISTS (select primary Key ID
From #tmp where #tmp. primary key id= table. Primary Key ID)
Remove duplicate records with SQL statements (GO)