Fast deduplication of mysql tables with tens of millions of data records _ MySQL

Source: Internet
Author: User
Quick de-duplication analysis of mysql tables with tens of millions of levels bitsCN.com quick de-duplication analysis of mysql tables with tens of millions of levels target: existing tables a and B, merge the data in the two tables to the c table. The data volume in Table a and Table B is about 2 million. Basic information operating system version: CentOS release 5.6 64-bit operating system Memory: 8 GB database version: 5.1.56-community 64-bit database initialization parameters: default database table and data table: mysql> desc a2kw; + ------- + ------------- + ------ + ----- + --------- + ------- + | Field | Type | Null | Key | Default | Extra | + ------- + ------------- + ------ + ----- + --------- + ------- + | c1 | varchar (20) | YES | MUL | NULL | c2 | varchar (30) | YES | NULL | c3 | varchar (12) | YES | NULL | c4 | varchar (20) | YES | NULL | + ------- + ------------- + ------ + ----- + --------- + ------- + 4 rows in set (0.00 sec) table bmysql> desc b2kw; + ------- + ------------- + ------ + ----- + --------- + ------- + | Field | Type | Null | Key | Default | Extra | + ------- + ------------- + ------ + ----- + --------- + ------- + | c1 | varchar (20) | YES | NULL | c2 | varchar (30) | YES | NULL | c3 | varchar (12) | YES | NULL | c4 | varchar (20) | YES | NULL | + ------- + ------------- + ------ + ----- + --------- + ------- + 4 rows in set (0.00 sec) the data in tables a and B is as follows: mysql> select * from a2kw limit 10; + ----------- + ------ + ---------- + | c1 | c2 | c3 | c4 | + ----------- + ------ + ---------- + | 662164461 | 131545534 | TOM0 | 20120520 | 226662142 | 605685564 | | TOM0 | 20120516 | 527008225 | 172557633 | TOM0 | 20120514 | 574408183 | 350897450 | TOM0 | 20120510 | 781619324 | 583989494 | TOM0 | 20120510 | 158872754 | 775676430 | TOM0 | 20120512 | 815875622 | 631631832 | TOM0 | 20120514 | 905943640 | 477433083 | TOM0 | 20120514 | 660790641 | 616774715 | TOM0 | 20120512 | 999083595 | 953186525 | TOM0 | 20120513 | + ----------- + ----------- + ------ + ---------- + 10 rows in set (0.01 sec) basic Step 1: create an index on Table B mysql> select count (*) from b2kw; + ---------- + | count (*) | + ---------- + | 20000002 | + ---------- + 1 row in set (0.00 sec) mysql> create index ind_b2kw_c1 on b2kw (c1); Query OK, 20000002 rows affected (1 min 2.94 sec) Records: 20000002 Duplicates: 0 Warnings: 0 data volume: 20000002, time: 1 min 2.94 sec2, insert a and B into the temp table of the intermediate table, create mysql> create table temp select * from c2kw where 1 = 2; Query OK, 0 rows affected (0.00 sec) Records: 0 Duplicates: 0 Warnings: 0 insert data mysql> insert into temp select * from a2kw; Query OK, 20000002 rows affected (13.23 sec) Records: 20000002 Duplicates: 0 Warnings: 0 mysql> insert into temp select * from b2kw; Query OK, 20000002 rows affected (13.27 sec) Records: 20000002 Duplicates: 0 Warnings: 0 mysql> select count (*) from temp; + ---------- + | count (*) | + ---------- + | 40000004 | + ---------- + 1 row in set (0.00 sec) the data volume is: 40000004, Time: 26.50 sec3, temp joint index, force index remove duplicate data mysql> create index ind_temp_c123 on temp (c1, c2, c3); Query OK, 40000004 rows affected (3 min 43.87 sec) Records: 40000004 Duplicates: 0 Warnings: 0 View execution plan mysql> explain select c1, c2, c3, max (c4) from temp FORCE INDEX
(Ind_temp_c123) group by c1, c2, c3; + ---- + ------------- + ------- + --------------- + ----------
----- + --------- + ------ + ---------- + ------- + | Id | select_type | table | type | possible_keys | key
| Key_len | ref | rows | Extra | + ---- + ------------- + ------- + --------------- + -------------
-- + --------- + ------ + ---------- + ------- + | 1 | SIMPLE | temp | index | NULL | ind_temp_c123 | 71
| NULL | 40000004 | + ---- + ------------- + ------- + --------------- + --------
------- + --------- + ------ + ---------- + ------- + 1 row in set (0.05 sec) mysql> insert into c2kw select c1, c2, c3, max (c4) from temp
Force index (ind_temp_c123) group by c1, c2, c3; Query OK, 20000004 rows affected (2 min 0.85 sec) Records: 20000004 Duplicates: 0 Warnings: 0 the actual cost is: 6 min
4. delete the intermediate table mysql> drop table temp; Query OK, 0 rows affected (0.99 sec). the actual cost is: 1 sec.
5. create a c index mysql> create index ind_c2kw_c1 on c2kw (c1); Query OK, 20000004 rows affected (49.74 sec) Records: 20000004 Duplicates: 0 Warnings: 0 mysql> create index ind_c2kw_c2 on c2kw (c2); Query OK, 20000004 rows affected (1 min 47.20 sec) Records: 20000004 Duplicates: 0 Warnings: 0 mysql> create index ind_c2kw_c3 on c2kw (c3); Query OK, 20000004 rows affected (2 min 42.02 sec) Records: 20000004 Duplicates: 0 Warnings: 0 the actual cost is: 5 minutes
6. clear mysql tables a and B> truncate table a2kw; Query OK, 0 rows affected (1.15 sec) mysql> truncate table b2kw; Query OK, 0 rows affected (1.34 sec) the actual cost is about: the total time spent by 3sec is about 15 minutes. The author RuleV5bitsCN.com

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.