Because MySQL does not support working on a table at the same time, the subquery and the action to be performed cannot be the same table, so you need to pass the temporary table below.
1. Repetition of single-word fields
Generates a temporary table where UID is a field that needs to be de-weighed
Create Table as (Selectfrom groupby hasCount(UID)) Create Table as (Selectminfromgroupby haveCount( ) UID)
Be sure to create an index for the UID when the quantity is large
Alter Table Add Index Index name (field name) Alter Table Add index name (field name)
Delete redundant duplicate data, preserving the smallest ID in duplicate data
Delete from User_info where not inch (Select from tmp_id) and inch (Select from Tmp_uid)
2, multi-field repetition
If the above due to the duplication of the UID indirectly caused the record duplication in relationship, so continue to heavy.
2.1 General Methods
Basic to the same top:
Generating temporary tables
Create TableTmp_relation as(SelectSource,target fromRelationshipGroup bySource,target having Count(*)>1)Create Tabletmp_relationship_id as(Select min(ID) asId fromRelationshipGroup bySource,target having Count(*)>1)
Create an index
Alter Table Add index name (field name)
Delete
Delete from Relationship where not inch (Select from tmp_relationship_id) and inch (Select from relationship)
2.2 Quick Method
In practice, it is found that the above method of removing field duplication, because there is no way to rebuild the index for multiple fields, resulting in large data volume is very inefficient, low to unbearable. Finally, can't stand waiting for a long while not responding to the situation, I decided to take a path.
Consider that the number of repetitions of the same record is estimated to be low. Typically 2, or 3, the number of repetitions is more concentrated. So you can try to delete the largest of the duplicates directly, until it is deleted to not repeat, then its ID is the smallest in the repetition at that time.
The approximate process is as follows:
(1), select one record with the largest ID in each duplicate
Create Table as (selectmaxfromgroupby havecount(*) > 1)
(2), create INDEX (only need to execute at first time)
Alter Table Add index name (field name)
(3), delete the record with the largest ID in duplicates
Delete from where inch (Select from Tmp_relation_id2)
(4), delete temporary table
Drop Table Tmp_relation_id2
Repeat the steps above (1), (2), (3), (4) until the record is not present in the created temporary table (more efficient for repeated data)
This article was transferred from http://www.cnblogs.com/rainduck/archive/2013/05/15/3079868.html
MySQL data deduplication and related optimizations (RPM)