International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > MySQL

MySQL data deduplication and related optimizations

Last Update:2014-11-02 Source: Internet

Author: User

Tags repetition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Because MySQL does not support working on a table at the same time, the subquery and the action to be performed cannot be the same table, so you need to pass the temporary table below.

1. Repetition of single-word fields

Generates a temporary table where UID is a field that needs to be de-weighed

Create Table  as (Selectfrom groupby hasCount(UID)) Create Table  as (Selectminfromgroupby haveCount( ) UID)

Be sure to create an index for the UID when the quantity is large

Alter Table Add Index Index name (field name) Alter Table Add  index name (field name)

Delete redundant duplicate data, preserving the smallest ID in duplicate data

Delete  from User_info where  not inch (Select from tmp_id)  and inch (Select from Tmp_uid)

2, multi-field repetition

If the above due to the duplication of the UID indirectly caused the record duplication in relationship, so continue to heavy.

2.1 General Methods

Basic to the same top:

Generating temporary tables

Create TableTmp_relation as(SelectSource,target fromRelationshipGroup  bySource,target having Count(*)>1)Create Tabletmp_relationship_id as(Select min(ID) asId fromRelationshipGroup  bySource,target having Count(*)>1)

Create an index

Alter Table Add  index name (field name)

Delete

Delete  from Relationship where  not inch (Select from tmp_relationship_id)  and inch (Select from relationship)

2.2 Quick Method

In practice, it is found that the above method of removing field duplication, because there is no way to rebuild the index for multiple fields, resulting in large data volume is very inefficient, low to unbearable. Finally, can't stand waiting for a long while not responding to the situation, I decided to take a path.

Consider that the number of repetitions of the same record is estimated to be low. Typically 2, or 3, the number of repetitions is more concentrated. So you can try to delete the largest of the duplicates directly, until it is deleted to not repeat, then its ID is the smallest in the repetition at that time.

The approximate process is as follows:

(1), select one record with the largest ID in each duplicate

Create Table  as (selectmaxfromgroupby havecount(*) > 1)

(2), create INDEX (only need to execute at first time)

Alter Table Add  index name (field name)

(3), delete the record with the largest ID in duplicates

Delete  from where inch (Select from Tmp_relation_id2)

(4), delete temporary table

Drop Table Tmp_relation_id2

Repeat the steps above (1), (2), (3), (4) until the record is not present in the created temporary table (more efficient for repeated data)

This article was transferred from http://www.cnblogs.com/rainduck/archive/2013/05/15/3079868.html

MySQL data deduplication and related optimizations (RPM)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

data deduplication software data deduplication software free what is data deduplication customer data deduplication data domain deduplication deduplication php retrieve data from mysql and display

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MySQL data deduplication and related optimizations

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support