Details of Linux system data landing

Source: Internet
Author: User

This article extracts from here, in the original, the MySQL InnoDB system as an example, introduced the data through the various levels of buffer and cache, other systems have similar principles, excerpt from this.

3. VFS Layer

The buffer of this layer is placed in the host memory, it is mainly to buffer the data at the operating system layer, to avoid slow block device read and write operation affecting the response time of IO.

3.1. Scrutiny O_direct/o_sync Label

In the previous discussion of redo log buffer and InnoDB buffer pool, a lot of data refresh and data security issues were involved, and in this section we specifically discuss the meaning of the o_direct/o_sync tag.

We open a file and write the data, how the VFS and the filesystem write the data to the hardware layer column, showing the key data structures:

Figure 4 VFS Cache diagram

The figure is referenced from the Linux kernel's VFS Layer.

In the diagram, we see that the Page_cache/buffer cache/inode-cache/directory cache is mainly in this layer. where Page_cache/buffer cache is primarily used to buffer memory structure data and block device data. Instead, Inode-cache is used to buffer Inode,directory-cache for buffering directory structure data.

Depending on the file system and operating system, generally writing to a file consists of two parts, writes to the data itself, and writes to the file attributes (metadata metadata) (here the file attributes include directory, Inode, etc.).

Knowing this, we will be able to make it easier to articulate the meaning of each logo:

Page cache

Buffer cache

Inode Cache

Dictory Cache

O_direct

Write bypass

Write bypass

Write & No Flush

Write & No Flush

O_dsync/fdatasync ()

Write & Flush

Write & Flush

Write & No Flush

Write & No Flush

O_sync/fsync ()

Write & Flush

Write & Flush

Write & Flush

Write & Flush

Table 3 VFS Cache Refresh Table

The difference between O_dsync and Fdatasync () is that it is refreshed at each IO commit for the corresponding page cache and buffer cache (that is, O_dsync, which was refreshed before the write operation was successfully returned) Or the entire page cache and buffer cache are refreshed at the time of the call to Fdatasync () after the write operation of certain data. O_sync and Fsync () differ in the same vein.

The main difference between page cache and buffer cache is that one is for actual file data and one for block devices. Using open () in the VFS upper layer opens those files using the Mkfs file system, you will use the page cache and buffer cache, and if you use DD on the Linux operating system to operate the Linux block device, You will only use the buffer cache.

The difference between O_dsync and O_sync is that O_dsync tells the kernel that when writing data to a file, the write operation is completed only when the data is written to the disk (write returns success). O_sync is more stringent than o_dsync, requiring not only that the data has been written to the disk, but also that the properties of the corresponding data file (such as the file Inode, related directory changes, etc.) need to be updated to be completed before the write operation succeeds. Visible O_sync do more than O_dsync.

L Open () also has a o_async in the Referense, which is mainly used for terminals, pseudoterminals, sockets, and Pipes/fifos, is the signal-driven IO, When the device can read and write a signal (SIGIO), the application process captures this signal for IO operation.

L O_sync and O_direct are all synchronous, that is, only if the write is successful will it return.

Looking back, let's look at Innodb_flush_log_at_trx_commit's configuration for a better understanding. O_direct Direct IO bypasses the page cache/buffer cache and why it needs to be fsync () in order to flush the directory cache and Inode cache metadata to the storage device.

Because of the kernel and file system updates, some file systems can guarantee that the O_direct mode without Fsync () synchronization metadata will not lead to data security issues, so InnoDB provides a o_direct_no_fsync way.

Of course, O_direct is effective for both reading and writing, especially for reading, which guarantees that the data read is read from the storage device, not in the cache. Avoid inconsistencies between the data in the cache and the data on the storage device (for example, you update the data for the underlying block device through DRBD, and for non-distributed file systems, the content in the cache is inconsistent with the data on the storage device). But we're mainly talking about buffering (writing buffer), and we're not going to discuss it in depth. The problem.

3.2. O_direct Advantages and disadvantages

O_direct is recommended for most of the Innodb_flush_method parameter values, and even all_o_direct is provided in the Percona Server branch, and the log file is also opened in O_direct mode.

3.2.1. Advantages:

L Save operating system memory: O_direct directly bypasses page Cache/buffer cache, so that InnoDB is not used to read and write data in the operating system less memory, more memory to leave a InnoDB buffer pool to use.

L Save CPU. In addition, memory-to-storage devices are mainly transmitted in poll, interrupts, and DMA modes. Use the O_direct method to prompt the operating system to use DMA as much as possible for storage device operation, save CPU.

3.2.2. Disadvantages

L-byte alignment. The O_direct method requires that the memory be byte aligned when writing data (the alignment is different depending on the kernel and the file system). This requires that the data be written with additional alignment operations. You can know the size of the alignment by/sys/block/sda/queue/logical_block_size, which is typically 512 bytes.

L cannot perform IO merge. O_direct bypasses the page cache/buffer cache direct write storage device, so if the same block of data can not be hit in memory, the page cache/buffer cache merge write function will not take effect.

l reduce sequential read and write efficiency. If you open a file with O_direct, the read/write operation skips the cache and reads/writes directly on the storage device. Because there is no cache, the sequential reading and writing of files using O_direct This small IO request is less efficient.

In general, using O_direct to set up Innodb_flush_method is not 100% for all applications and scenarios.

4. Storage Controller Layer

The buffer of this layer is placed in the corresponding onboard cache of the storage controller, which is mainly to buffer the data at the storage controller layer and avoid the slow block device reading and writing operation affecting the response time of IO.

When data is brushed to the storage layer, such as Fsync (), it is first sent to the storage controller layer. A common storage controller is a RAID card, and most of the raid cards currently have 1G or larger storage capacity. This buffer is generally volatile storage, through the onboard battery/capacitor to ensure that the "volatile storage" data will still be synchronized to the underlying disk storage media after the machine loses power.

About storage controllers There are a few things we need to be aware of:

    1. Write Back/write through:

For the use of buffering, the general storage controller provides write back and write through two ways. Write back mode, the written data submitted by the operating system is written directly to the buffer to return to success, and write through mode, the operating system submits the writing data request must be really written to the underlying disk media to return success.

    1. Battery/capacitance differences:

In order to ensure that the data in the "volatile" buffer can be flushed to the underlying disk media in time after the machine has been powered down, the storage controller has a battery/capacitor to guarantee. The normal battery has the capacity attenuation problem, namely every time, the onboard battery should be controlled charge and discharge once, in order to guarantee the battery capacity. The storage controller that is set to Write-back is automatically changed to write through during battery charging. This charge and discharge cycle (learn cycle) is typically 90 days, and LSI cards can be viewed through MEGACLI:

#MegaCli-adpbbucmd-getbbuproperties-aall

BBU Properties for adapter:0

Auto Learn period:90 days

Next Learn Time:tue Oct 14 05:38:43 2014

Learn Delay interval:0 Hours

Auto-learn mode:enabled

If you find that the IO request response time is suddenly slowing down every once in a while, it may be the problem. By megacli-adpeventlog-getevents-f the event description:battery in the Mr_adpeventlog.txt-aall log started Charging can determine if a charge or discharge has occurred.

Because of the problem with the battery, the new RAID card configures the capacitance to ensure that the data in the "volatile" buffer can be flushed to the underlying disk media in a timely manner, so there is no charge/discharge problem.

    1. Read/write ratio:

The HP Smart Array provides the difference between read and write to the cache (Accelerator Ratio),

HPACUCLI Ctrl all show config Detail|grep ' Accelerator Ratio '

Accelerator ratio:25% read/75% Write

This allows you to set the ratio of cache read and buffer write caches to the actual application.

    1. Turn on direct IO

In order to enable the upper device to use direct IO to bypass the raid card, the raid needs to be set to open Directio mode:

/opt/megaraid/megacli/megacli64-ldsetprop-direct-immediate-lall-aall

    1. LSI Flash Raid:

Above we mentioned the "volatile" buffer, if we now have a non-volatile buffer, and the capacity of hundreds of g, such a storage controller buffer is more able to speed up the underlying device? As a veteran RAID card vendor, LSI currently has such a storage controller, which can be considered using write back and comparison of applications that rely on storage controller buffering.

    1. Write barriers

Currently the RAID card cache is not visible to Linux for battery or capacitor protection, so Linux in order to ensure the consistency of the log file system, the default is to open the write barriers, that is, it will constantly refresh the "volatile" buffer, which will greatly reduce the IO performance. So if you are sure that the underlying battery can guarantee that the "volatile" buffer will be brushed to the underlying disk device, you can add-o nobarrier to the disk mount.

5. Disk controller layer

The buffer for this layer is placed on the disk controller's corresponding onboard cache. The storage device firmware (firmware) sorts the write operations into the media by ordering the rules. This is mainly to ensure that the order of the writing, for mechanical disks, so as far as possible to allow the movement of one head to complete more disk write operations.

In general, the DMA controller is placed on the disk layer, directly through the DMA controller memory access, can save CPU resources.

For mechanical hard drives, because there is no battery capacitance on the general disk device, there is no guarantee that all the data in the disk cache will be synchronized to the media in time when the machine is down, so we strongly recommend that disk cache be shut down.

Disk cache can be turned off at the storage controller level. For example, the command to close with MEGACLI is as follows:

Megacli-ldsetprop-disdskcache-lall-aall

Details of Linux system data landing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.