MySQL data deduplication and related optimizations

Source: Internet
Author: User
Tags repetition

Because MySQL does not support working on a table at the same time, the subquery and the action to be performed cannot be the same table, so you need to pass the temporary table below.

1. Repetition of single-word fields

Generates a temporary table where UID is a field that needs to be de-weighed

Create Table  as (Selectfrom groupby hasCount(UID)) Create Table  as (Selectminfromgroupby haveCount( ) UID)

Be sure to create an index for the UID when the quantity is large

Alter Table Add Index Index name (field name) Alter Table Add  index name (field name)

Delete redundant duplicate data, preserving the smallest ID in duplicate data

Delete  from User_info where  not inch (Select from tmp_id)  and inch (Select from Tmp_uid)

2, multi-field repetition

If the above due to the duplication of the UID indirectly caused the record duplication in relationship, so continue to heavy.

2.1 General Methods

Basic to the same top:

Generating temporary tables

Create TableTmp_relation as(SelectSource,target fromRelationshipGroup  bySource,target having Count(*)>1)Create Tabletmp_relationship_id as(Select min(ID) asId fromRelationshipGroup  bySource,target having Count(*)>1)

Create an index

Alter Table Add  index name (field name)

Delete

Delete  from Relationship where  not inch (Select from tmp_relationship_id)  and inch (Select from relationship)

2.2 Quick Method

In practice, it is found that the above method of removing field duplication, because there is no way to rebuild the index for multiple fields, resulting in large data volume is very inefficient, low to unbearable. Finally, can't stand waiting for a long while not responding to the situation, I decided to take a path.

Consider that the number of repetitions of the same record is estimated to be low. Typically 2, or 3, the number of repetitions is more concentrated. So you can try to delete the largest of the duplicates directly, until it is deleted to not repeat, then its ID is the smallest in the repetition at that time.

The approximate process is as follows:

(1), select one record with the largest ID in each duplicate

Create Table  as (selectmaxfromgroupby havecount(*) > 1)

(2), create INDEX (only need to execute at first time)

Alter Table Add  index name (field name)

(3), delete the record with the largest ID in duplicates

Delete  from where inch (Select from Tmp_relation_id2)

(4), delete temporary table

Drop Table Tmp_relation_id2

Repeat the steps above (1), (2), (3), (4) until the record is not present in the created temporary table (more efficient for repeated data)

This article was transferred from http://www.cnblogs.com/rainduck/archive/2013/05/15/3079868.html

MySQL data deduplication and related optimizations (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.