A closer look at InnoDB data landing

Source: Internet
Author: User
Tags flushes variable scope volatile percona percona server

This article is from: Wo Fun technology http://www.woqutech.com/?p=1459
1. Overview

Many of the previous heroes have shared the various scenarios in which MySQL's InnoDB storage engine will refresh data. This article goes down from InnoDB to see what buffers are working on the data from InnoDB's memory to the media that is actually written to the storage device.

Let's look at the relevant buffers:

Figure 1 InnoDB all buffers

From this, we can see that the data InnoDB to disk need to go through

    1. InnoDB buffer pool, Redo log buffer. This is the buffer of the INNODB application itself.
    2. Page Cache/buffer cache (can be bypassed by O_direct). This is the buffer of the VFS layer.
    3. Inode cache/directory buffer. This is also the buffer of the VFS layer. Need to be refreshed by O_sync or Fsync ().
    4. Write-back buffer. (can set storage Controller parameter bypass)
    5. Disk On-borad buffer. (Can be bypassed by setting disk controller parameters)

Here we use the term "buffer" (typically buffer) to represent the staging of the data write, using the term "cache" (typically the caches) to represent the staging of the data read. As the name implies, because of the difference in the rate between the underlying storage device and memory, buffering is used to temporarily "slow down" the "punch" of the underlying storage device IO. The cache is primarily in memory to "save" the data read from the disk, so that subsequent access to the data does not have to access the slow underlying storage device again.

The discussion of buffer and cache can be referred to Peng Lixun:

Http://www.penglixun.com/tech/system/buffer_and_cache_diff.html

Let's take a detailed description of these buffers from top to bottom.

2. InnoDB Layer

This layer of buffering is placed in the host memory, it is mainly to manage their own data in the application layer, to avoid slow read and write operations affecting the response time of the InnoDB.

The INNODB layer consists of two Buffer:redo log buffer and InnoDB buffer pool. The redo log buffer is used to stage log writes to the Redo log redo log, InnoDB buffer pool stores the InnoDB data read from the disk device, and also buffers the InnoDB data write, dirty page data. If the host is out of power or MySQL is abnormally down, InnoDB buffer pool will not be flushed to disk in a timely manner, then InnoDB can only roll forward from the previous checkpoint using redo log, and if redo log buffer cannot be flushed to disk in time, Then, due to the loss of data in the redo log, even with redo roll forward, the user-submitted transaction is lost because there is no real record in the non-volatile disk media.

The parameter that controls the refresh time of the redo log buffer is innodb_flush_log_at_trx_commit, while the parameters for control redo log buffer and InnoDB buffer pool Refresh method are Innodb_flush_ Method There are a lot of articles in detail about these two parameters, which we mainly analyze from the perspective of buffering.

2.1. Innodb_flush_log_at_trx_commit

The Innodb_flush_log_at_trx_commit that controls redo log buffer currently supports 3 different parameter values 0,1,2

Figure 2 Innodb_flush_log_at_trx_commit

Here to steal a lazy, direct reference to the figure of the yuan. Also, update the changes in the innodb_flush_log_at_trx_commit=2 at 5.6:

< 5.6.6: Flushes data from redo log buffer to disk every second

>= 5.6.6: Flushes the data to disk every innodb_flush_log_at_timeout seconds.

We will not discuss this in detail here, but we can refer to the MySQL data loss discussion.

2.2. Innodb_flush_method

The Innodb_flush_method that controls InnoDB buffer pool currently supports 4 different parameter values:

L Fdatasync

L O_dsync

L O_direct

L O_direct_no_fsync

Here we notice a few questions:

    1. INNODB_FLUSH_METHOD specifies not only how the data file is refreshed, but also how the log file is refreshed.
    2. These parameters are not in the Windows environment of the parameter configuration, now everyone started to bird gates brother? In fact, in the comments are written, Windows uses async_unbuffered, and does not allow the modification, so it is not written in the list.
    3. The first three parameter values are only allowed in versions prior to 5.6.6 and 5.6.6, and a new O_direct_no_fsync is added starting with 5.6.7. This means that the file is opened with O_direct, but no fsync () is used to synchronize the data. This is due to the newer Linux kernel and some file systems, the use of o_direct can be used to ensure data security, not specifically Fsync () to synchronize, to ensure that the metadata is also flushed to non-volatile disk media. For example, XFS cannot use this parameter. O_direct bypasses page cache, why use Fsync () to refresh the following, which we'll discuss in the next section.
    4. Some people would say that the Referense document had a small bug,5.6.6 before the version default is Fdatasync, but valid values can be specified in a value that does not have fdatasync.

System Variable Name

Innodb_flush_method

Variable Scope

Global

Dynamic Variable

No

Permitted Values (<= 5.6.6)

Type (Linux)

String

Default

Fdatasync

Valid Values

O_dsync

O_direct

Table 1 Innodb_flush_method Optional values

In fact, this is his intention, because Fdatasync () and Fsync () is not the same, like O_dsync and o_sync the difference. Fdatasync and O_dsync are only used for data synchronization, Fsync () and O_sync are used for data and metadata Meta-data synchronization. However, MySQL uses the Fdatasync parameter value to indicate that "data file" and "log File" are opened with Fsync (note: not Fdatasync ()), this is a historical reason, so 5.6 deliberately remove it from the optional values, to avoid misunderstanding. Of course, if you still want to use Fsync () to synchronize, then Innodb_flush_method do not specify anything to do.

    1. In addition to O_direct_no_fsync, InnoDB uses Fsync () to refresh the data file. The exception here is O_direct_no_fsync.
    2. If you specify O_direct,o_direct_no_fsync, the data file is opened as O_direct (Solaris is opened with Directio (), and if the InnoDB data files are placed on a separate device, the Mount Using Forcedirectio makes the entire file system open with Directio. The reason for InnoDB instead of MySQL is that MyISAM do not use Directio ())

Less gossip, the following table and a graph are more intuitive to illustrate the problem:

The Refresh Relationship table for Orczhou was re-processed:

Open Log

Flush Log

Flush Log

Open datafile

Flush DataFile

Fdatasync

 

Fsync ()

 

Fsync ()

O_dsync

O_sync

 

 

Fsync ()

O_direct

 

Fsync ()

O_direct

Fsync ()

O_direct_no_fsync

Fsync ()

O_direct

 

All_o_direct

(Percona)

O_direct

Fsync ()

O_direct

Fsync

Table 2 Innodb_flush_method data file and log Refresh table

Figure 3 Innodb_flush_method data file and log refresh

3. VFS Layer

The buffer of this layer is placed in the host memory, it is mainly to buffer the data at the operating system layer, to avoid slow block device read and write operation affecting the response time of IO.

3.1. Scrutiny O_direct/o_sync Label

In the previous discussion of redo log buffer and InnoDB buffer pool, a lot of data refresh and data security issues were involved, and in this section we specifically discuss the meaning of the o_direct/o_sync tag.

We open a file and write the data, how the VFS and the filesystem write the data to the hardware layer column, showing the key data structures:

Figure 4 VFS Cache diagram

The figure is referenced from the Linux kernel's VFS Layer.

In the diagram, we see that the Page_cache/buffer cache/inode-cache/directory cache is mainly in this layer. where Page_cache/buffer cache is primarily used to buffer memory structure data and block device data. Instead, Inode-cache is used to buffer Inode,directory-cache for buffering directory structure data.

Depending on the file system and operating system, generally writing to a file consists of two parts, writes to the data itself, and writes to the file attributes (metadata metadata) (here the file attributes include directory, Inode, etc.).

Knowing this, we will be able to make it easier to articulate the meaning of each logo:

Page cache

Buffer cache

Inode Cache

Dictory Cache

O_direct

Write bypass

Write bypass

Write & No Flush

Write & No Flush

O_dsync/fdatasync ()

Write & Flush

Write & Flush

Write & No Flush

Write & No Flush

O_sync/fsync ()

Write & Flush

Write & Flush

Write & Flush

Write & Flush

Table 3 VFS Cache Refresh Table

The difference between O_dsync and Fdatasync () is that it is refreshed for the corresponding page cache and buffer cache at each IO commit, or when the Fdatasync () is called after the write operation of a certain data to the entire page Cache and buffer cache are refreshed. O_sync and Fsync () differ in the same vein.

The main difference between page cache and buffer cache is that one is for actual file data and one for block devices. Using open () in the VFS upper layer opens those files using the Mkfs file system, you will use the page cache and buffer cache, and if you use DD on the Linux operating system to operate the Linux block device, You will only use the buffer cache.

The difference between O_dsync and O_sync is that O_dsync tells the kernel that when writing data to a file, the write operation is completed only when the data is written to the disk (write returns success). O_sync is more stringent than o_dsync, requiring not only that the data has been written to the disk, but also that the properties of the corresponding data file (such as the file Inode, related directory changes, etc.) need to be updated to be completed before the write operation succeeds. Visible O_sync do more than O_dsync.

L Open () also has a o_async in the Referense, which is mainly used for terminals, pseudoterminals, sockets, and Pipes/fifos, is the signal-driven IO, When the device can read and write a signal (SIGIO), the application process captures this signal for IO operation.

L O_sync and O_direct are all synchronous, that is, only if the write is successful will it return.

Looking back, let's look at Innodb_flush_log_at_trx_commit's configuration for a better understanding. O_direct Direct IO bypasses the page cache/buffer cache and why it needs to be fsync () in order to flush the directory cache and Inode cache metadata to the storage device.

Because of the kernel and file system updates, some file systems can guarantee that the O_direct mode without Fsync () synchronization metadata will not lead to data security issues, so InnoDB provides a o_direct_no_fsync way.

Of course, O_direct is effective for both reading and writing, especially for reading, which guarantees that the data read is read from the storage device, not in the cache. Avoid inconsistencies between the data in the cache and the data on the storage device (for example, you update the data for the underlying block device through DRBD, and for non-distributed file systems, the content in the cache is inconsistent with the data on the storage device). But we're mainly talking about buffering (writing buffer), and we're not going to discuss it in depth. The problem.

3.2. O_direct Advantages and disadvantages

O_direct is recommended for most of the Innodb_flush_method parameter values, and even all_o_direct is provided in the Percona Server branch, and the log file is also opened in O_direct mode.

3.2.1. Advantages:

L Save operating system memory: O_direct directly bypasses page Cache/buffer cache, so that InnoDB is not used to read and write data in the operating system less memory, more memory to leave a InnoDB buffer pool to use.

L Save CPU. In addition, memory-to-storage devices are mainly transmitted in poll, interrupts, and DMA modes. Use the O_direct method to prompt the operating system to use DMA as much as possible for storage device operation, save CPU.

3.2.2. Disadvantages

L-byte alignment. The O_direct method requires that the memory be byte aligned when writing data (the alignment is different depending on the kernel and the file system). This requires that the data be written with additional alignment operations. You can know the size of the alignment by/sys/block/sda/queue/logical_block_size, which is typically 512 bytes.

L cannot perform IO merge. O_direct bypasses the page cache/buffer cache direct write storage device, so if the same block of data can not be hit in memory, the page cache/buffer cache merge write function will not take effect.

l reduce sequential read and write efficiency. If you open a file with O_direct, the read/write operation skips the cache and reads/writes directly on the storage device. Because there is no cache, the sequential reading and writing of files using O_direct This small IO request is less efficient.

In general, using O_direct to set up Innodb_flush_method is not 100% for all applications and scenarios.

4. Storage Controller Layer

The buffer of this layer is placed in the corresponding onboard cache of the storage controller, which is mainly to buffer the data at the storage controller layer and avoid the slow block device reading and writing operation affecting the response time of IO.

When data is brushed to the storage layer, such as Fsync (), it is first sent to the storage controller layer. A common storage controller is a RAID card, and most of the raid cards currently have 1G or larger storage capacity. This buffer is generally volatile storage, through the onboard battery/capacitor to ensure that the "volatile storage" data will still be synchronized to the underlying disk storage media after the machine loses power.

About storage controllers There are a few things we need to be aware of:

    1. Write Back/write through:

For the use of buffering, the general storage controller provides write back and write through two ways. Write back mode, the written data submitted by the operating system is written directly to the buffer to return to success, and write through mode, the operating system submits the writing data request must be really written to the underlying disk media to return success.

    1. Battery/capacitance differences:

In order to ensure that the data in the "volatile" buffer can be flushed to the underlying disk media in time after the machine has been powered down, the storage controller has a battery/capacitor to guarantee. The normal battery has the capacity attenuation problem, namely every time, the onboard battery should be controlled charge and discharge once, in order to guarantee the battery capacity. The storage controller that is set to Write-back is automatically changed to write through during battery charging. This charge and discharge cycle (learn cycle) is typically 90 days, and LSI cards can be viewed through MEGACLI:

#MegaCli-adpbbucmd-getbbuproperties-aall

BBU Properties for adapter:0

Auto Learn period:90 days

Next Learn Time:tue Oct 14 05:38:43 2014

Learn Delay interval:0 Hours

Auto-learn mode:enabled

If you find that the IO request response time is suddenly slowing down every once in a while, it may be the problem. By megacli-adpeventlog-getevents-f the event description:battery in the Mr_adpeventlog.txt-aall log started Charging can determine if a charge or discharge has occurred.

Because of the problem with the battery, the new RAID card configures the capacitance to ensure that the data in the "volatile" buffer can be flushed to the underlying disk media in a timely manner, so there is no charge/discharge problem.

    1. Read/write ratio:

The HP Smart Array provides the difference between read and write to the cache (Accelerator Ratio),

HPACUCLI Ctrl all show config Detail|grep ' Accelerator Ratio '

Accelerator ratio:25% read/75% Write

This allows you to set the ratio of cache read and buffer write caches to the actual application.

    1. Turn on direct IO

In order to enable the upper device to use direct IO to bypass the raid card, the raid needs to be set to open Directio mode:

/opt/megaraid/megacli/megacli64-ldsetprop-direct-immediate-lall-aall

    1. LSI Flash Raid:

Above we mentioned the "volatile" buffer, if we now have a non-volatile buffer, and the capacity of hundreds of g, such a storage controller buffer is more able to speed up the underlying device? As a veteran RAID card vendor, LSI currently has such a storage controller, which can be considered using write back and comparison of applications that rely on storage controller buffering.

    1. Write barriers

Currently the RAID card cache is not visible to Linux for battery or capacitor protection, so Linux in order to ensure the consistency of the log file system, the default is to open the write barriers, that is, it will constantly refresh the "volatile" buffer, which will greatly reduce the IO performance. So if you are sure that the underlying battery can guarantee that the "volatile" buffer will be brushed to the underlying disk device, you can add-o nobarrier to the disk mount.

5. Disk controller layer

The buffer for this layer is placed on the disk controller's corresponding onboard cache. The storage device firmware (firmware) sorts the write operations into the media by ordering the rules. This is mainly to ensure that the order of the writing, for mechanical disks, so as far as possible to allow the movement of one head to complete more disk write operations.

In general, the DMA controller is placed on the disk layer, directly through the DMA controller memory access, can save CPU resources.

For mechanical hard drives, because there is no battery capacitance on the general disk device, there is no guarantee that all the data in the disk cache will be synchronized to the media in time when the machine is down, so we strongly recommend that disk cache be shut down.

Disk cache can be turned off at the storage controller level. For example, the command to close with MEGACLI is as follows:

Megacli-ldsetprop-disdskcache-lall-aall

6. Summary

From InnoDB to the end of the media, we have a variety of buffers, their purpose is actually very clear, is to solve: the memory and disk speed mismatch problem, or the speed of the disk is too slow.

In addition, the most understanding of whether the data should be buffered/cached or the application itself, VFS, the storage controller and the disk can only be deferred write (in order to merge the duplicate Io, so that random write into sequential write) to alleviate the slow speed of the underlying storage device response slow problem. So the database type of application will manage the buffer itself, and then try to avoid the operating system and the underlying device buffering.

But in fact, due to the current SSD and PCIe Flash card, the speed difference between memory and disk is greatly reduced, whether these buffers are necessary, hardware and software which can be improved, a major challenge to hardware and software engineers.

Reference:

Http://www.codeproject.com/Articles/460057/HDD-FS-O_SYNC-Throughput-vs-Integrity

Http://rdc.taobao.com/blog/dba/html/296_innodb_flush_method_performance.html

http://www.orczhou.com/index.php/2009/08/innodb_flush_method-file-io/

http://blog.csdn.net/yuyin86/article/details/8113305

http://www.mtop.cc/node/100

Https://www.usenix.org/legacy/event/usenix01/full_papers/kroeger/kroeger_html/node8.html

Http://www.lsi.com/downloads/Public/Direct%20Assets/LSI/Benchmark_Tips.pdf

Http://www.lsi.com/products/flash-accelerators/pages/default.aspx.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ Writebarrieronoff.html

Http://en.wikipedia.org/wiki/Direct_memory_access

http://www.hgst.com/tech/techlib.nsf/techdocs/DFE76984029D3BE586256FAB0058B1A8/$file/dma-white_paper_final.pdf

Http://en.wikipedia.org/wiki/Disk_buffer

7. Appendix

7.1. Python CODE for O_direct Way

The wrong way:

Import OS

f = os.open (' file ', OS. O_creat | Os. O_trunc | Os. O_direct | Os. O_RDWR)

s = ' * 1024

Os.write (f, s)

Traceback (most recent):

File "", Line 1, in

OSError: [Errno] Invalid argument

The right way:

Import OS

Import Mmap

f = os.open (' file ', OS. O_creat | Os. O_direct | Os. O_trunc | Os. O_RDWR)

m = Mmap.mmap (-1, 1024 * 1024)

s = ' * 1024 * 1024

M.write (s)

Os.write (f, m)

Os.close (f)

Examine InnoDB data on disk

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.