Data deduplication and Optimization in mysql

Source: Internet
Author: User

After data deduplication and Optimization in mysql changes the primary key uid of table user_info to the auto-increment id, you forget to set the original primary key uid attribute to unique. As a result, duplicate uid records are generated. To this end, you need to clear the records that are inserted later. You can refer to the attached documents for the basic method. However, because mysql does not support operations on a table at the same time, that is, subqueries and operations to be performed cannot be the same table, therefore, you need to use the zero-time table to transfer the data. Preface: when the data volume is large, you must create an index for multiple key fields !!! Otherwise, it will be slow, slow, and slow. The heart that is slow to death has a single field that repeatedly generates a zero-time table, uid is the field to be de-duplicated create table tmpuid as (select uid from userinfo group by uid having count (uid) create table tmpid as (select min (id) from userinfo group by uid having count (uid) when the data volume is large, you must create an index for the uid create index indexuid on tmpuid create index indexid on tmpid to delete redundant duplicate records, delete from user_info where id not in (select id from tmp_id) and uid in (select uid fro M tmp_uid) 2. Duplicate multiple fields are indirectly caused by duplicate uid records in relationship, so deduplication continues. First, we will introduce the normal processing process, and introduce more effective methods based on the characteristics of my own data! 2.1 Basic Methods: create table tmp_relation as (select source, target from relationship group by source, target having count (*)> 1) create table tmprelationshipid as (select min (id) as id from relationship group by source, target having count (*)> 1) create index indexid on tmprelationship_id delete from relationship where id not in (select id from tmprelationshipid) and (source, target) in (Select source, target from relationship) 2.2 Practice the method of finding duplicate Deleted fields in the preceding practice. As there is no way to re-index multiple fields, the efficiency of large data volumes is extremely low, which is unacceptable. Finally, I decided to find another way to solve the problem after waiting for a long time. Considering that it is estimated that the repeat times of the same record is relatively low. Generally 2 or 3, and the number of repetitions is concentrated. Therefore, you can try to directly Delete the largest repeated items until they are deleted until they are not repeated. At this time, the id is naturally the smallest in the repeated items. The general process is as follows: 1) select the create table tmprelationid2 as (select max (id) from relationship group by source, target having count (*)> 1) 2) create an index (only required at the first time) create index indexid on tmprelation_id2 3) delete from relationship where id in (select id from tmprelationid2) delete temporary table drop table tmprelationid2 repeat the preceding steps 1), 2), 3 ), 4) until the created temporary table does not have a record and ends (for repeated data, it is more efficient) the method for querying and deleting duplicate records (1) 1. Search for redundant duplicate records in the Table. duplicate records are based on a single field (peopleId) select * from people where peopleId in (select peopleId from people group by peopleId having count (peopleId)> 1) 2. Delete unnecessary duplicate records in the table, repeat records are determined based on a single field (peopleId). Only the records with the smallest rowid are retained: delete from people where peopleId in (select peopleId from people group by leleid having count (peopleId)> 1) and rowid not in (select min (rowid) from people group by peopleId having count (peopleId)> 1) 3. Search for redundant duplicate records in the table (multiple fields) select * from vitae a where (. peopleId,. seq) in (select peopleId, seq from vitae group by peopleId, seq having count (*)> 1) 4. Delete redundant record (multiple fields) in the table ), delete from vitae a where (. peopleId,. seq) in (select peopleId, seq from vitae group by peopleId, seq having count ()> 1) and rowid not in (select min (rowid) from vitae group by peopleId, seq having count ()> 1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.