Loading half a billion rows into MySQL---reprint

Last Update:2015-10-28 Source: Internet

Author: User

Tags redis version percona

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background

We have a legacy system with our production environment A, keeps track of if a user takes an action on causes.com (joins A cause, recruits a friend, etc). I say legacy, but I really mean a prematurely-optimized system. This 500m record database is split across monthly sharded tables. Seems like a great solution to scaling (and it's) –except that we don ' t need it. And based on our usage pattern (e.g. to count a user's total number of actions, we need-do query N-tables), this leads To pretty severe performance degradation issues. Even with memcache layer sitting in front of old month tables, new features keep discovering new n-query performance Probl Ems. Noticing that we had another database happily chugging along with a million records, I decided to migrate the existing System into a single table setup. The goals were:

Reduce complexity. Querying one table is simpler than N tables.
Push as much complexity as possible to the database. The wrappers around the month-sharding logic in Rails is slow and buggy.
Increase performance. Also related to one table query being simpler than N.

Alternative proposed Solutions

MySQL Partitioning: This is the most similar to we existing set up, since MySQL internally stores the data into Different tables. We decided against it because it seemed likely that it wouldn ' t being much faster than our current solution (although MySQL C An internally does some optimizations to make sure a look at tables, that could possibly has data you want). And it ' s still the same complexity we were looking to reduce (and would further is the only database set up in our system Using partitioning).

Redis: Not really proposed as a alternative because the full dataset won ' t fit to memory, but something we ' re Considering loading a subset of the data into to answer queries so we make a IoT that MySQL isn ' t particularly good at ( e.g. ' Which of my friends has taken an action ' are quick using Redis ' s built in SET UNION function. The new MySQL table might is performant enough that it doesn ' t do sense to build a fast Redis version, so we ' re avoiding This as possible premature optimization, especially with a technology we ' re not as familiar with.

Dumping the old data

MySQL provides the ' mysqldump ' utility to allow quick dumping to disk:

  MSYQLDUMP-T/var/lib/mysql/database_data database_name

This would produce a TSV file for each table in the database, and this is the format that ' LOAD INFILE ' 'll be able to qui Ckly load later on.

Installing Percona 5.5

We ' ll be building the new system with the latest-and-greatest in Percona databases on CentOS 6.2:

  RPM-UHV http://www.percona.com/downloads/percona-release/percona-release-0.0-1.x86_64.rpm  Yum Install Percona-server-shared-compat percona-server-client-55 percona-server-server-55-y

[Open bug with the Compat package:https://bugs.launchpad.net/percona-server/+bug/908620]

Specify a directory for the InnoDB data

This isn ' t exactly a performance tip, but I had to do some digging to get MySQL to store data on a different partition. The first step is to do use your my.cnf contains a

DataDir =/path/to/data

directive. Make sure are /path/to/data owned by Mysql:mysql ( chown -R mysql.mysql /path/to/data ) and run:

mysql_install_db--user=mysql--datadir=/path/to/data

This would set up the directory structures, InnoDB uses to store data. This is also useful if you ' re aborting a failed data load and want to wipe the slate clean (if you don ' t specify a Directo Ry,/var/lib/mysql is used by default). Just

RM-RF *

The data directory and run the mysql_install_db command.

[* http://dev.mysql.com/doc/refman/5.5/en/mysql-install-db.html]

SQL Commands to speed up the LOAD DATA

You can tell MySQL to not enforce foreign key and uniqueness constraints:

  SET foreign_key_checks = 0;  SET unique_checks = 0;

and drop the transaction isolation guarantee to uncommitted:

  SET SESSION tx_isolation= ' read-uncommitted '

and turn off the Binlog with:

  SET sql_log_bin = 0

And when you've done, don ' t forget to turn it back on with:

  SET unique_checks = 1;  SET foreign_key_checks = 1;  SET SESSION tx_isolation= ' read-repeatable '

It's worth noting that a lot of resources would tell you the "DISABLE KEYS" directive and has the indices all BU ILT once all the data have been loaded into the table. Unfortunately, InnoDB does not support this. I tried it, and while it took only a few hours to load 500m rows, the data is unusable without any indices. You could drop the indices completely and add them later, but with a table size this big I didn ' t think it would help much .

Another red herring was turning off autocommit and committing after each ' LOAD DATA ' statement. This is effectively the same thing as autocommitting, and manually commiting led to ' LOAD DATA ' slowdowns a quarter of th e-In.

[Http://dev.mysql.com/doc/refman/5.1/en/alter-table.html, search for "DISABLE KEYS"] [http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/]

Performance adjustments made to MY.CNF

  --Http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit  -- This loosens the frequency with which, the data is flushed to disk  --it's possible to lose a second or both of the data thi S  -on-the event of a--system crash, but it's in a very controlled circumstance  innodb_flush_log_at_trx_com mit=2  --rule of thumb is 75%-80% of total system memory  INNODB_BUFFER_POOL_SIZE=16GB  --don ' t let the OS CA Che what InnoDB is caching anyway  --http://www.mysqlperformanceblog.com/2007/11/01/ innodb-performance-optimization-basics/  innodb_flush_method=o_direct  --don ' t double write the data  - -Http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_doublewrite  Innodb_ Doublewrite = 0

Use LOAD DATA INFILE

The optimized path toward bulk loading structured data into MySQL. 8.2.2.1. Speed of inserts statements predicts a ~20x speedup over a bulk inserts (i.e. an insert with thousands of rows in a Single statement). See also 8.5.4. Bulk Data Loading for InnoDB Tables to a few more tips.

Not be it faster, but in my experience with this migration, the INSERT method would slow down faster than it can load Data and effectively never finish (last estimate I made were, but it is still slowing down).

INFILE must be in the directory, the InnoDB is storing, that database information. If MySQL is In/var/lib/mysql, then MyDatabase would be in/var/lib/mysql/mydatabase. If you do not have access to this directory on the server, you can use the LOAD DATA LOCAL INFILE. In my testing, putting the "file in the" proper place and using ' load DATA INFILE ' increased load performance by about 20%.

[Http://dev.mysql.com/doc/refman/5.5/en/load-data.html]

Perform your data transformation directly in MySQL

Our old actioncredit system were unique on (MONTH (CREATED_AT), id), but the new system is going to generate new Autoincreme Nting IDs for each records as it's loaded in chronological order. The problem is that my-GB of TSV data doesn ' t match up to the new schema. Some scripts I had that would use Ruby to transform the old row into the new row was laughably slow. I did some digging, and found out, that's you can tell, MySQL to (quickly), throw away the data you don ' t want in the load state ment itself, using parameter binding:

  LOAD DATA INFILE ' data.csv ' into TABLE mytable fields  TERMINATED by ' \ t ' enclosed by ' \ '  (@throwaway), user_id, a Ction, Created_at

This statement was telling MySQL which fields was represented in Data.csv. @throwaway is a binding parameter; And in this case we want to discard it so we are not going to bind it. If we wanted to insert a prefix, we could execute:

  LOAD DATA INFILE ' data.csv ' into TABLE mytable fields  TERMINATED by ' \ t ' enclosed by ' \ '  (ID, user_id, @action, C Reated_at  SET action=concat (' prefix_ ', action)

And every loaded row ' s ' action ' column would begin with the string ' prefix '.

Checking progress without disrupting the import

If you ' re loading large data files and want to check the progress, you definitely don ' t want ' SELECT COUNT (*) from Table '. This query would degrade as the size of the table grows and slowdown the LOAD process. Instead You can query:

Mysql> SELECT table_rows from information_schema.tables WHERE table_name = ' table '; +------------+| Table_rows |+------------+|   27273886 |+------------+1 row in Set (0.23 sec)

If you want to watch/log the progress over time, you can craft a quick shell command to poll the number of rows:

$ while:;  Do mysql-hlocalhost databasename-e "select Table_rows from information_schema.tables WHERE table_name = ' table ' \G;" | grep rows | Cut-d ': '-f2 | Xargs echo ' date + '%F%R ', | Tee Load.log && Sleep 30;  done2012-05-29 18:16, 322672442012-05-29 18:16, 323280022012-05-29 18:17, 324041892012-05-29 18:17, 324739362012-05-29 18:18, 325436982012-05-29 18:18, 326169392012-05-29 18:19, 32693198

The ' tee ' would echo to STDOUT as well as to ' File.log ', the ' \g ' formats the columns in the result set as rows, and the SL Eep gives it a pause between loading.

LOAD DATA chunking Script

I quickly discovered that throwing a 50m row TSV file at LOAD DATA is a good the IT performance degrade to the point of not finishing. I settled on using the ' split ' to chunk data into one million rows per file:

 forMonth_tableinchAction*.txt; Do                                                                                            Echo "$ (date) splitting $month _table ..."                                                                                    Split-L1000000$month _table curmonth_ forSegmentinchcurmonth_*; Do                                                                                                 Echo "On segment $segment"                                                                                                   TimeMysql-hlocalhost Action_credit_silo <<-SQL SET foreign_key_checks=0; SET Unique_checks=0; SET SESSION tx_isolation='read-uncommitted'; SET Sql_log_bin=0; LOAD DATA INFILE'$segment'Into TABLE actioncredits fields TERMINATED by '\ t'Enclosed by'\"'(@throwawayId, Action, user_id, Targ et_user_id, cause_id, Item_type, item_id, activity_id, Created_at, utm_campaign);                                                                                                                             Sql RM$segment  Done                                                                                                                        MV$month _table $month _table. Done                                                                                          Done

Wrap-up

Over the duration of this script, I saw chunk load time increase from 1m40s to around an hour per million inserts. This was however better than not finishing @ all, which I wasn ' t able to achieve until making all changes suggested in thi s post and using the aforementioned ' load.sh ' script. Other tips:

Use as few indices as can
Loading the data in sequential order is makes the loading faster, but the resulting table would be faster
If you can load any of the data from MySQL (instead of a flat file intermediary), it'll be much faster. You can use the "INSERT into". SELECT ' statement to copy data between tables quickly.

Loading half a billion rows into MySQL---reprint

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More