Fast deduplication of mysql tables with tens of millions of data records

Source: Internet
Author: User

Quick de-duplication Analysis of tens of millions of mysql merged tables target: existing tables a and B, merge the data in the two tables to the c table. The data volume in Table a and table B is about 2 million. Basic information operating system version: CentOS release 5.6 64-bit operating system memory: 8 GB database version: 5.1.56-community 64-bit database initialization parameters: default database table and data table: www.2cto.com mysql> desc a2kw; + ------- + ------------- + ------ + ----- + --------- + ------- + | Field | Type | Null | Key | Default | Extra | + ------- + ------------- + ------ + ----- + --------- + ------- + | c1 | varchar (20) | YES | MUL | NULL | c2 | varchar (30) | YES | NULL | c3 | varchar (12) | YES | NULL | c4 | varchar (20) | YES | NULL | + ------- + ------------- + ------ + ----- + --------- + ------- + 4 rows in set (0.00 sec) Table bmysql> desc b2kw; + ------- + ------------- + ------ + ----- + --------- + ------- + | Field | Type | Null | Key | Default | Extra | + ------- + ------------- + ------ + ----- + --------- + ------- + | c1 | varchar (20) | YES | NULL | c2 | varchar (30) | YES | NULL | c3 | varchar (12) | YES | NULL | c4 | varchar (20) | YES | NULL | + ------- + ------------- + ------ + ----- + --------- + ------- + 4 rows in set (0.00 sec) the data in tables a and B is as follows: mysql> select * from a2kw limit 10; + ----------- + ------ + ---------- + | c1 | c2 | c3 | c4 | + ----------- + ------ + ---------- + | 662164461 | 131545534 | TOM0 | 20120520 | 226662142 | 605685564 | | TOM0 | 20120516 | 527008225 | 172557633 | TOM0 | 20120514 | 574408183 | 350897450 | TOM0 | 20120510 | 781619324 | 583989494 | TOM0 | 20120510 | 158872754 | 775676430 | TOM0 | 20120512 | 815875622 | 631631832 | TOM0 | 20120514 | 905943640 | 477433083 | TOM0 | 20120514 | 660790641 | 616774715 | TOM0 | 20120512 | 999083595 | 953186525 | TOM0 | 20120513 | + ----------- + ----------- + ------ + ---------- + 10 rows in set (0.01 sec) basic Steps: www.2cto.com 1. Create an index on Table B. mysql> select count (*) from b2kw; + ---------- + | count (*) | + ---------- + | 20000002 | + ---------- + 1 row in set (0.00 sec) mysql> create index ind_b2kw_c1 on b2kw (c1); Query OK, 20000002 rows affected (1 min 2.94 sec) Records: 20000002 Duplicates: 0 Warnings: 0 data volume: 20000002, time: 1 min 2.94 sec2, insert a and B into the temp table of the intermediate table, create mysql> create table temp select * from c2kw where 1 = 2; Query OK, 0 rows affected (0.00 sec) Records: 0 Duplicates: 0 Warnings: 0 insert data mysql> insert into temp select * from a2kw; Query OK, 20000002 rows affected (13.23 sec) Records: 20000002 Duplicates: 0 Warnings: 0 mysql> insert into temp select * from b2kw; Query OK, 20000002 rows affected (13.27 sec) Records: 20000002 Duplicates: 0 Warnings: 0 www.2cto.com mysql> select count (*) from temp; + ---------- + | count (*) | + ---------- + | 40000004 | + ---------- + 1 row in set (0.00 sec) data volume: 40000004, time: 26.50 sec3, temp Joint index, forced index remove duplicate data mysql> create index ind_temp_c123 on temp (c1, c2, c3); Query OK, 40000004 rows affected (3 min 43.87 sec) Records: 40000004 Duplicates: 0 Warnings: 0 view execution plan mysql> explain select c1, c2, c3, max (c4) from temp FORCE INDEX
(Ind_temp_c123) group by c1, c2, c3; + ---- + ------------- + ------- + --------------- + ----------
----- + --------- + ------ + ---------- + ------- + | Id | select_type | table | type | possible_keys | key
| Key_len | ref | rows | Extra | + ---- + ------------- + ------- + --------------- + -------------
-- + --------- + ------ + ---------- + ------- + | 1 | SIMPLE | temp | index | NULL | ind_temp_c123 | 71
| NULL | 40000004 | + ---- + ----------- + ------- + www.2cto.com ------- + --------------- + --------
------- + --------- + ------ + ---------- + ------- + 1 row in set (0.05 sec) mysql> insert into c2kw select c1, c2, c3, max (c4) from temp
Force index (ind_temp_c123) group by c1, c2, c3; Query OK, 20000004 rows affected (2 min 0.85 sec) Records: 20000004 Duplicates: 0 Warnings: 0 the actual cost is: 6 min
4. Delete the intermediate table mysql> drop table temp; Query OK, 0 rows affected (0.99 sec). The actual cost is: 1 sec.
5. create a c index mysql> create index ind_c2kw_c1 on c2kw (c1); Query OK, 20000004 rows affected (49.74 sec) Records: 20000004 Duplicates: 0 Warnings: 0 mysql> create index ind_c2kw_c2 on c2kw (c2); Query OK, 20000004 rows affected (1 min 47.20 sec) Records: 20000004 Duplicates: 0 Warnings: 0 mysql> create index ind_c2kw_c3 on c2kw (c3); Query OK, 20000004 rows affected (2 min 42.02 sec) Records: 20000004 Duplicates: 0 Warnings: 0 the actual cost is: 5 minutes www.2cto.com
6. Clear mysql tables a and B> truncate table a2kw; Query OK, 0 rows affected (1.15 sec) mysql> truncate table b2kw; Query OK, 0 rows affected (1.34 sec) the actual cost is: the total time spent by 3sec is about 15 minutes. The author RuleV5

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.