Original: Delete MySQL duplicate data
Delete MySQL duplicate data project background
In a recent Linux performance acquisition project, found that the thread of the program storage is slow, and then carefully locate, found in the database a lot of redundant data. Because in the acquisition, for the same device, the same point in time should have only one data, however, the database is stored in a number of data. For how to cause this result, for a while do not think clearly, but in order to solve the problem of slow storage, first to delete redundant data.
Problem description
The table structure of the database is simple, as follows:
+----------------+--------------+------+-----+---------+-------+| Field | Type | Null | Key | Default | Extra |+----------------+--------------+------+-----+---------+-------+| id | varchar(255) | NO | PRI | NULL | || conf_id | varchar(255) | NO | MUL | NULL | || insert_time | datetime | YES | | NULL | || cpu_usage | float(11,2) | YES | | NULL | || memory_usage | float(11,2) | YES | | NULL | || io_usage_write | float(11,2) | YES | | NULL | || io_usage_read | float(11,2) | YES | | NULL | |+----------------+--------------+------+-----+---------+-------+
Querying all data volumes
select count(*) from perf_linux;
Output 427366
Query the amount of data for different devices at all points in time
select count(distinct conf_id, insert_time) from perf_linux ;
Output 42387
From the above data can be seen, data redundancy of about 10 times times.
Then take a look at the time group:
select id, conf_id ,insert_time from perf_linux order by insert_time, conf_id;
Output:
| 2a79f7cd-43a9-4c7b-adb2-316b6c04283e | 1 | 2014-12-09 15:09:14 | | 50D6F6C2-9C8B-45FD-98FD-2BE211221CFD | 1 | 2014-12-09 15:09:14 | | 740b52e1-e868-4074-ba36-74e2634401b3 | 1 | 2014-12-09 15:09:14 | | 8b0096a4-9e85-417b-a131-e3505ca79a9c | 1 | 2014-12-09 15:09:14 | | 90a9e882-5220-4508-a56f-8d4ab4a7929b | 1 | 2014-12-09 15:09:14 | | D17403ed-24a4-45e8-b51b-2a95118383d9 | 1 | 2014-12-09 15:09:14 | | 0C2DA917-579B-4080-857D-7159F38B44AC | 2 | 2014-12-09 15:09:14 | | 263083eb-8f63-4d2b-a03f-3320aa678735 | 2 | 2014-12-09 15:09:14 | | d6c57a38-080b-465a-a55a-beafd9daf32d | 2 | 2014-12-09 15:09:14 | | f672227b-1fb8-4b85-880d-2cc34b02880d | 2 | 2014-12-09 15:09:14 | | F80020fe-6cb5-48ec-beb0-4e8ebeb0ca57 | 2 | 2014-12-09 15:09:14 | | ff633a35-824d-49ba-b78c-5bcc5df8d1cc | 2 | 2014-12-09 15:09:14 | | 5c41e48a-abfc-4108-a00e-ca7def7d5a5a | 3 | 2014-12-09 15:09:14 | | 60B7AB9E-C91A-4020-A6D3-7BCEB1DC47C5 | 3 | 2014-12-09 on 15: 09:14 | | 7b6cd2b8-ac6d-43eb-8858-e15885e676c8 | 3 | 2014-12-09 15:09:14 | | D53a3df5-08c4-4604-8fac-cb51077935f6 | 3 | 2014-12-09 15:09:14 | | d9e4ba14-f98d-42a8-b3bc-2879d58aa797 | 3 | 2014-12-09 15:09:14 | | f56f82f6-32a7-47f7-ae07-b13168743884 | 3 | 2014-12-09 15:09:14 | | 076c4c1b-0028-4a9c-a8c4-de655bd6ab6b | 4 | 2014-12-09 15:09:14 | | 2A90AD9E-11A5-4707-95E8-78491DA658AD | 4 | 2014-12-09 15:09:14 | | 3b17ad1d-e589-4b65-93a7-d61fc99b4071 | 4 | 2014-12-09 15:09:14 | | 6988d6cf-44ef-47f7-808d-09791caf2d90 | 4 | 2014-12-09 15:09:14 | | 8404d281-f9e5-4153-a47e-128c05386758 | 4 | 2014-12-09 15:09:14 | | e042e310-7ff2-4e4d-8c98-71e3e4d57828 | 4 | 2014-12-09 15:09:14 |+--------------------------------------+---------+---------------------+
From the visible, the same time point of the same device data redundancy, now we want to remove these redundant data.
Workaround
The idea is this: first of all, according to CONF_ID and time point to judge, group by query, each group to take one more. Grouping is very simple, but how to take a group? I used the form of an intermediate table.
Create an intermediate table and import the data into the intermediate table
create table perf_linux_t like perf_linux;
insert into perf_linux_t select * from perf_linux;
Add a field to the intermediate table, which is self-growing.
ALTER TABLE `perf_linux_t` ADD COLUMN `auto_id` INT NOT NULL AUTO_INCREMENT ,DROP PRIMARY KEY,ADD PRIMARY KEY (`auto_id`);
Delete useless data
Check it out first.
select min(auto_id) as auto_id from perf_linux_t group by insert_time ;
Delete the incorrect data
delete from perf_linux_t where auto_id not in (select min(auto_id) as auto_id from perf_linux_t group by insert_time);
Wait, output error:
You can ' t specify target table ' perf_linux_t ' for update in FROM clause
Can not delete Ah, it can only build another intermediate table.
Re-build the intermediate table
create table tmp like perf_linux_t;
Instead of deleting non-conforming data, the data is stored in the new table.
insert into tmp select * from perf_linux_t where auto_id in (select min(auto_id) as auto_id from perf_linux_t group by insert_time,conf_id );
Remove the useless columns from this table
ALTER TABLE `tmp` DROP COLUMN `auto_id`,DROP PRIMARY KEY;
Data back
Delete the original data
truncate table perf_linux;
Inserting data
insert into perf_linux select * from tmp;
Delete an intermediate table
drop table tmp;
drop table perf_linux_t;
Summarize
By this method, the data is changed to 42,387 and the redundant data is deleted. But in fact the problem of the program is not fully positioned, but also need to observe to locate the problem.
And I think that this method should be compared to the method of soil, if you have a better way, I hope the generous enlighten.
Reference:
MySQL method for deleting duplicate record statements
Delete MySQL duplicate data