Data Deduplication in mysql and bitsCN.com optimization
Data Deduplication and optimization in mysql
After you change the primary key uid of user_info to the auto-increment id, you forget to set the original primary key uid attribute to unique. as a result, duplicate uid records are generated. To this end, you need to clear the records that are inserted later.
You can refer to the attached documents for the basic method. However, because mysql does not support operations on a table at the same time, that is, subqueries and operations to be performed cannot be the same table, therefore, you need to use the zero-time table to transfer the data.
Preface: when the data volume is large, you must create an index for multiple key fields !!! Otherwise, it will be slow, slow, and easy to die.
1. single field already exists
Generate a zero-time table, where uid is the field to be de-duplicated
Create table tmpuid as (select uid from userinfo group by uid having count (uid ))
Create table tmpid as (select min (id) from userinfo group by uid having count (uid ))
When the data volume is large, you must create an index for the uid.
Create index indexuid on tmpuid
Create index indexid on tmpid
Delete redundant duplicate records and keep the minimum id of the repeated items
Delete from user_info where id not in (select id from tmp_id) and uid in (select uid from tmp_uid)
2. repeated fields
Duplicate uid leads to duplicate record in relationship indirectly, so deduplication continues. First, we will introduce the normal processing process, and introduce more effective methods based on the characteristics of my own data!
2.1 General method
The basic requirements are as follows:
Generate a zero-time table
Create table tmp_relation as (select source, target from relationship group by source, target having count (*)> 1)
Create table tmprelationshipid as (select min (id) as id from relationship group by source, target having count (*)> 1)
Create an index
Create index indexid on tmprelationship_id
Delete
Delete from relationship where id not in (select id from tmprelationshipid) and (source, target) in (select source, target from relationship)
2.2 Practice
In practice, we found that the method for deleting duplicate fields was found. as there was no way to re-index multiple fields, the efficiency of large data volumes was extremely low and it was unacceptable. Finally, I decided to find another way to solve the problem after waiting for a long time.
Considering that it is estimated that the repeat times of the same record is relatively low. Generally 2 or 3, and the number of repetitions is concentrated. Therefore, you can try to directly delete the largest repeated items until they are deleted until they are not repeated. at this time, the id is naturally the smallest in the repeated items.
The general process is as follows:
1) select the record with the largest id in each repeated item
Create table tmprelationid2 as (select max (id) from relationship group by source, target having count (*)> 1)
2) create an index (only required at the first time)
Create index indexid on tmprelation_id2
3) delete the record with the largest id in the repeated items
Delete from relationship where id in (select id from tmprelationid2)
4) delete a temporary table
Drop table tmprelationid2
Repeat steps 1), 2), 3), 4) until the created temporary table does not have a record and ends (for repeated data, it is more efficient)
How to query and delete duplicate records
(1) 1. search for redundant duplicate records in the table. duplicate records are based on a single field (peopleId) select * from people where peopleId in (select peopleId from people group by peopleId having count (peopleId)> 1)
2. delete unnecessary duplicate records in the table. duplicate records are determined based on a single field (peopleId, only records with the smallest rowid are left: delete from people where peopleId in (select peopleId from people group by peopleId having count (peopleId)> 1) and rowid not in (select min (rowid) from people group by peopleId having count (peopleId)> 1)
3. search for redundant duplicate records in the table (multiple fields) select * from vitae a where (. peopleId,. seq) in (select peopleId, seq from vitae group by peopleId, seq having count (*)> 1)
4. delete redundant record (multiple fields) in the table, with only the records with the smallest rowid deleted from vitae a where (. peopleId,. seq) in (select peopleId, seq from vitae group by peopleId, seq having count ()> 1) and rowid not in (select min (rowid) from vitae group by peopleId, seq having count ()> 1)
BitsCN.com