Linux sync Io:sync, Fsync and Fdatasync

Source: Internet
Author: User
Tags posix



Traditional UNIX implementations have a buffer cache or page cache in the kernel, and most disk I/O is buffered. When writing data to a file, the kernel usually copies the data into one of the buffers, and if the buffer is not yet full, it is not queued to the output queue, but waits for it to be full, or when the kernel needs to reuse the buffer to hold other disk block data, then the buffer is queued to the output queue, and then when it arrives at the first To perform the actual I/O operation. This output is called deferred write (delayed Write) (Bach [1986] in 3rd chapter discusses the buffer cache in detail).
Deferred write reduces disk read and write times, but reduces file content updates so that data written to the file is not written to disk for a period of time. In the event of a system failure, this delay may result in the loss of file update content. To ensure consistency between the actual file system on the disk and the content in the buffer cache, the UNIX system provides sync, Fsync, and Fdatasync three functions.
The sync function simply queues all the modified block buffers into the write queue and then returns, not waiting for the actual write disk operation to end.
The system daemon, commonly referred to as update, periodically calls the sync function (typically every 30 seconds). This ensures that the block buffers of the kernel are flushed periodically. Command Sync (1) also calls the Sync function.
The Fsync function works only on a single file specified by the file descriptor Filedes, and waits for the write disk operation to end and then returns. Fsync can be used in applications such as databases, which need to ensure that modified blocks are immediately written to disk.
The Fdatasync function is similar to Fsync, but it affects only the data portion of the file. In addition to the data, Fsync also synchronizes the properties of the updated file.

the in a database that provides transactional support, when a transaction commits, it is necessary to ensure that the transaction log (which contains the modifications and a commit record) is fully written to the hard disk before the transaction is committed and returned to the application tier.

A simple question: on the *nix operating system, how to ensure that the updated content of the file is successfully persisted to the hard disk?

1. Write not enough, need Fsync

In general, the write operation on the hard disk (or other persistent storage device) file updates only the in-memory page cache, and dirty pages are not immediately updated to the hard disk, but are uniformly dispatched by the operating system. If a dedicated flusher kernel thread synchronizes a dirty page to the hard disk (in the IO request queue of the device) within a certain condition, such as a certain time interval and a certain percentage of dirty pages in memory.

Because the write call does not return after the hard disk IO is complete, the data may be lost if the OS crashes after the write call and before the hard disk is synchronized. Although such a time window is small, the " loosely asynchronous Semantics" provided by write () is not enough for a database program that needs to ensure transactional persistence (durability) and consistency (consistency), often requiring OS-provided Synchronous IO(Synchronized-io) primitives to ensure:

1 #include <unistd.h>2 int fsync (int fd);

The function of Fsync is to make sure that all modified content of file FD has been correctly synced to the hard disk, and that the call will block wait until the device reports IO completion.

PS: If you use a memory-mapped file for file io (using mmapto map the file's page cache directly to the address space of the process, modify the file by writing memory), there are similar system calls to ensure that the modified content is fully synchronized to the hard disk:

1 #incude <sys/mman.h>
2 int msync (voidint flags)

Msync need to specify a synchronized address range, so fine-grained control seems more efficient than fsync (because the application usually knows its dirty page location), but in fact (Linux) kernel has a very efficient data structure that can quickly find the dirty pages of a file. Allows Fsync to synchronize only the modified contents of the file.

2. Fsync performance issues, with Fdatasync

In addition to synchronizing the modified content (dirty pages) of the file, Fsync also synchronizes the file's descriptive information (metadata, including size, Access time St_atime & St_mtime, and so on), Because the file's data and metadata usually exist in different parts of the hard disk, Fsync requires at least two IO writes, Fsync's man page says:

" Unfortunately fsync ()  will always Initialize-write Operations:one for the newly written data and another one in order to update the modification time stored in the inode. If The modification time is not a part of the transaction Concept fdatasync () /span> can is used to avoid unnecessary inode disk write operations. "

How expensive is the extra IO operation? According to the Wikipedia data, the current hard drive average seek time (Average seek times) is approximately 4ms for the average rotational delay of the 3~15ms,7200rpm hard drive (Average rotational latency), So an IO operation takes around 10ms . What does this number mean? It will also be mentioned below.

POSIX also defines Fdatasync, which relaxes the semantics of synchronization to improve performance:

1 #include <unistd.h>2 int fdatasync (int fd);

The Fdatasync function is similar to the Fsync, but synchronizes metadata only when necessary, thus reducing the IO write operation to one time. So what is the "necessary situation"? According to the explanation in the man page:

"Fdatasync does not flush modified metadata unless this metadata is needed in order to allow a subsequent data retrieval T o be corretly handled. "

For example, if the size of the file (st_size) changes, it needs to be synchronized immediately, otherwise the OS crashes, even if the data portion of the file is synchronized, because metadata is not synchronized, the modified content is still not read. The last access Time (atime)/modification Time (mtime) is not required to be synchronized every time, as long as the application has no harsh requirements for these two timestamps, basically harmless.

The Ps:open parameter o_sync/o_dsync has the same semantics as Fsync/fdatasync: Each write will block waiting for the hard disk IO to complete. (In fact, Linux does the same for O_sync/o_dsync, does not meet POSIX requirements, but all implements the semantics of Fdatasync) relative to Fsync/fdatasync, such a setup is not flexible enough, should be used infrequently.

3. Using Fdatasync to optimize log synchronization

As mentioned at the beginning of the article, in order to satisfy the transaction requirements, the log files of the database are often required to synchronize IO. Because of the need to wait for the hard disk IO to complete synchronously, the commit operation of a transaction is often time-consuming and a bottleneck for performance.

Under Berkeley DB, if auto_commitis turned on (all independent writes automatically have transactional semantics) and the default synchronization level is used (the log is fully synchronized to the hard disk to return), writing a record takes approximately 5~10ms levels. The basic and one IO operation (10ms) takes the same time.

We already know that fsync on synchronization is inefficient. However, if you need to use Fdatasync to reduce the metadata update, you need to make sure that the size of the file does not change before or after the write. Log files are inherently additional (append-only), always growing, and it seems difficult to take advantage of good fdatasync.

and see how Berkeley DB handles log files:

1. Each log file is fixed to 10MB size, numbering starts at 1, and the name format is "log.%0 10d "

2. Each time the log file is created, the last 1 page of the file is written, and the log file is expanded to 10MB size

3. When appending records to the log file, using Fdatasync can greatly optimize the efficiency of writing log because the size of the file does not change.

4. If a log file is full, create a new log file with only one synchronization metadata overhead



Linux sync Io:sync, Fsync and Fdatasync

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.