More rapid full table scans in InnoDB
generally, most application queries use the index to find a few rows of data (primary key lookup or hundreds of rows of queries), but sometimes we need a full table query. A typical full table scan is a logical backup (mysqldump) and a online schema changes (note: The operation of the large table schema on the line, as well as a Facebook open source project) (SELECT ...). into outfile).
On Facebook We use mysqldump to back up the database. As you know MySQL provides two ways to backup, providing the commands and tools for physical and logical backups. Logical backups have some advantages over physical backups, such as:
- Logical backup backup data is much smaller. 3x-10x size differences are not uncommon.
- Easier to resolve a backup database. In a physical backup, when a serious problem occurs, such as a checksum failure. If we cannot recover the database, it is very difficult to know the INNODB internal data structure, or to repair the damage. We have more logical backups than physical backups.
The main disadvantage of logical backups is that full and full restores of the database are much slower than physical backup restores.
Slow, full logical backups tend to cause problems. It can take a long time if there are many large, fragmented tables in the database. In Facebook, we face mysqldump performance problems, leading to a complete logical backup of some (based on HDD and Flashcache) servers within a reasonable time. We know that InnoDB is not efficient because InnoDB does not actually read sequentially, and in most cases it is random. This is an old problem that has been known for years. Our database storage capacity has been growing, and the slow full table scanning problem has had a serious impact on us, so we decided to increase the speed at which InnoDB does sequential reading. Finally, our database critical team of engineers in the InnoDB implementation of the "Logical readahead" function. The application of "Logical ReadAhead", under the normal production workload, our full table scanning speed ratio of the previous increase of 9 ~ 10 times times. In the overload production, the entire table scanning speed reaches 15 ~ 20 times times the speed even faster.
The problem of a full table scan on large, fragmented data sheets
when you do a full table scan, InnoDB scans pages and rows in the primary key order. This applies to all InnoDB tables, including the fragmented tables. If the primary key page table is not fragmented (the page table that stores the primary key and rows), the full table scan is fairly fast because the reading order is close to the physical storage order. This is similar to reading the file from the operating system command (DD/CAT/ETC) as follows.
Copy Code code as follows:
DD if=/data/mysql/dbname/large_table.ibd of=/dev/null bs=16k iflag=direct
You may find that even on a commercial HDD server, you can achieve a higher than MB/s multiplier multiplied by "drive number" speed. More than 1gb/s is not uncommon.
Unfortunately, in many cases the primary key page table has fragments. For example, if you need to manage user_id and object_id mappings, the primary key will be (USER_ID,OBJECT_ID). Insertion sorting is inconsistent with user_id, so new inserts/updates often result in page splits. The new split page is assigned to a location away from the current page. This means that the page will be fragmented.
If the primary key page is fragmented, the full table scan will become extremely slow. Figure 1 illustrates this problem. After InnoDB reads the leaf page #3, it needs to read the page #5230 and read the page #4 after that. The page #5230 position is far from the page #3 and page #4, so the disk read operation order begins to become almost random rather than sequential. We all know that the random reading on HDD is much slower than continuous reading. An effective way to improve random read performance is to use SSD. However, SSD is much more expensive per gigabyte than HDD, so it is often impossible to use SSDs.
Figure 1. The full table scan is not actually continuously read
Does linear prefetching really make sense?
InnoDB supports the pre-read feature, known as "linear prefetching" (Linear read ahead). With linear prefetching, if n page can be accessed sequentially (n can be configured by the Innodb_read_ahead_threshold parameter, by default), InnoDB can read one extent at a time (64 consecutive page, If you do not compress each page to 1MB). But it doesn't make much sense to do so in practice. A extent (64 page) is very small. For a large, fragmented database table, the next page is not necessarily in the same extent. Figure 1 above is a good example. After reading the page#3, InnoDB needs to read the page#5230. Page#3 and page#5230 are not in the same extent, so the linear prefetching technique is not very useful here. This is a very common situation for large tables, so it also explains why the linear prefetching technique does not effectively improve the performance of full table scans.
Physical Pre-read
as described above, the main reason for the slow scanning of the whole table is that InnoDB is mainly randomly read. In order to speed up full table scans, you need to have the InnoDB read sequentially. The first way I thought of it was to create a UDF (user defined function) sequence to read the IBD file (the InnoDB data file). After the UDF is completed, the page of the IBD file should be kept in the InnoDB cache pool, so no random reads are necessary for the full table scan. The following is an example usage:
mysql> SELECT buf_warmup ("db1", "large_table"); /* loading into BUF pool
/mysql> select * from large_application_table;/* In-memory SELECT */
Buf_warmup () is a user-defined function used to read the entire IBD file of the database "DB1" table "Large_table". This function takes time to read the IBD file from the hard disk, but is much faster than random reads because it is sequentially read. In my test, it's almost 5 times times faster than normal linear prefetching.
This proves that sequential reading of IBD files can effectively improve throughput, but there are some disadvantages:
- This method does not work if the size of the table exceeds the size of the INNODB cache pool
- During the full table scan, reading the entire IBD file means not only that you need to read the primary key page, but also that you need to read the level two index page and some other unwanted page and save it in the cache pool, although only primary key page is actually needed. If you have a large number of two-level indexes, this method will not work effectively
- The application needs to make some modifications to invoke the UDF
This looks like a good enough solution, but our database design team has come up with a better solution called "Logical prefetching" (Logical read ahead), so we don't choose the UDF method.
Logical pre-read
the workflow for logical prefetching (LRA) is as follows:
- Read some branches of the primary key page
- Calculate the number of leaf page
- Reads in order of page number (most of which are sequential disk reads) Some (by the amount of configuration control) leaf page
- Read rows in order of primary key
The entire process is shown in Figure 2:
Fig 2:logical Read ahead
Logical prefetching solves the problem of physical prefetching. LRA enables InnoDB to read only the primary key page (no two-level index page is required), and the number of prefetch pages per read is controllable. In addition, LRA does not need to make any changes to the SQL syntax.
To make LRA work, we need to add two session variables. One is "Innodb_lra_size", which controls the prefetch leaf page size. The other is "Innodb_lra_sleep", which controls how long it sleeps between each pre read. We tested with 512MB~4096MB size and 50 milliseconds of sleep time, so far we haven't experienced any serious problems (such as crash/block/inconsistency, etc.). These session variables are set only when the entire table is needed. In our application, mysqldump and other auxiliary scripts enable logical prefetching.
Submit multiple Async I/O requests at a time
We note that another cause of performance problem is that InnoDB reads only one page per I/O, even if the prefetching technique is turned on. Reading only 16KB at a time is too small for sequential reads, and efficiency is much lower than large reading units.
In version 5.6, INNODB uses Linux local I/O by default. If multiple consecutive 16KB read requests are submitted at a time, Linux will incorporate these requests internally, and read operations can be performed more efficiently. Unfortunately, InnoDB will only submit one page of I/O requests at a time. I submitted a bug report#68659. As written in the bug report, in a contemporary HDD RAID 1+0 environment, if I submit 64 consecutive page read requests at a time, I can get more than 1000mb/s hard drive read speed If only one page read request is submitted at a time, we can only get 160mb/s hard drive read speed.
To make the LRA work better in our application environment, we fixed the problem. In our MySQL, InnoDB submits multiple page I/O requests before calling Io_submit ().
Benchmark Test
of all the tests, we used a database table (a paginated table) in the production environment.
1. Pure HDD Environment full table scan (basic benchmark, no other workload)
2. Online schema change under heavy workload
* Dump time only, not counting data loading time
Source
All of the enhancements we make can be obtained on the GitHub.
- -Logical pre-read implementation: diff
- -Submit multiple I/O requests at a time: diff
- -Enable logical prefetching:D in Mydqldump iff
Conclusions
InnoDB is not efficient for full table scanning, so we have made some changes to it. I have improved in two aspects, one is to implement the logical prefetching, one is to implement a multiple async read I/O request. We have 8-18 times the performance improvement for database tables in our production environment, which is useful for reducing backup time, pattern modification time, and so on. I want these features to be able to get official Oracle support in InnoDB, or at least the main MySQL branch.