Linux system calls---sync Io:sync, Fsync, and Fdatasync "Go"

Source: Internet
Author: User
Tags posix

Transferred from: http://blog.csdn.net/cywosp/article/details/8767327

[-]

    1. 1 write not enough to Fsync
    2. Performance problems and Fdatasync of 2 Fsync
    3. 3 using Fdatasync to optimize log synchronization

Traditional UNIX implementations have a buffer cache or page cache in the kernel, and most disk I/O is buffered. When writing data to a file, the kernel usually copies the data into one of the buffers, and if the buffer is not yet full, it is not queued to the output queue, but waits for it to be full, or when the kernel needs to reuse the buffer to hold other disk block data, then the buffer is queued to the output queue, and then when it arrives at the first To perform the actual I/O operation. This output is called deferred write (delayed Write) (Bach [1986] in 3rd chapter discusses the buffer cache in detail).
Deferred write reduces disk read and write times, but reduces file content updates so that data written to the file is not written to disk for a period of time. In the event of a system failure, this delay may result in the loss of file update content. To ensure consistency between the actual file system on the disk and the content in the buffer cache, the UNIX system provides sync, Fsync, and Fdatasync three functions.
The sync function simply queues all the modified block buffers into the write queue and then returns, not waiting for the actual write disk operation to end.
The system daemon, commonly referred to as update, periodically calls the sync function (typically every 30 seconds). This ensures that the block buffers of the kernel are flushed periodically. Command Sync (1) also calls the Sync function.
The Fsync function works only on a single file specified by the file descriptor Filedes, and waits for the write disk operation to end and then returns. Fsync can be used in applications such as databases, which need to ensure that modified blocks are immediately written to disk.
The Fdatasync function is similar to Fsync, but it affects only the data portion of the file. In addition to the data, Fsync also synchronizes the properties of the updated file.

For a database that provides transactional support, when a transaction commits, it is necessary to ensure that the transaction log (which contains the modification operation and a commit record) is fully written to the hard disk before the transaction is committed and returned to the application tier.

A simple question: on the *nix operating system, how to ensure that the updated content of the file is successfully persisted to the hard disk?

1. Write not enough, need to fsync in general, write operations on the hard disk (or other persistent storage device) files, update only the in-memory page cache, and dirty pages are not immediately updated to the hard disk, but by the operating system unified scheduling, If a dedicated flusher kernel thread synchronizes a dirty page to the hard disk (in the IO request queue of the device) within a certain condition, such as a certain time interval and a certain percentage of dirty pages in memory. Because the write call does not return after the hard disk IO is complete, the data may be lost if the OS crashes after the write call and before the hard disk is synchronized. Although such a time window is small, the "loosely asynchronous semantics" provided by write () is not enough for a database program that needs to ensure transactional persistence (durability) and consistency (consistency). The synchronization IO (Synchronized-io) primitives provided by the OS are often required to guarantee:
1 #include <unistd.h>2 int fsync (int fd);
The function of Fsync is to make sure that all modified content of file FD has been correctly synced to the hard disk, and that the call will block wait until the device reports IO completion. PS: If you use a memory-mapped file for file IO (using mmap to map the file's page cache directly to the address space of the process, modify the file by writing memory), there are similar system calls to ensure that the modified content is fully synchronized to the hard disk:
1 #incude <sys/mman.h>
2 int msync (void *addr, size_t length, int flags)

Msync need to specify a synchronized address range, so fine-grained control seems more efficient than fsync (because the application usually knows its dirty page location), but in fact (Linux) kernel has a very efficient data structure that can quickly find the dirty pages of a file. Allows Fsync to synchronize only the modified contents of the file.

2. Fsync performance issues, with Fdatasync in addition to synchronizing the contents of the files (dirty pages), Fsync also synchronizes the file description information (metadata, including size, Access time St_atime & St_mtime, etc.), Because the file's data and metadata usually exist in different parts of the hard disk, Fsync requires at least two IO writes, Fsync's man page says:

"Unfortunately Fsync () would always initialize-write Operations:one for the newly written data and another one in Ord Er to update the modification time stored in the inode. If The modification time isn't a part of the transaction concept Fdatasync () can being used to avoid unnecessary inode disk Write operations. "

How expensive is the extra IO operation? According to the Wikipedia data, the current hard drive average seek time (Average seek times) is approximately 4ms for the average rotational delay of the 3~15ms,7200rpm hard drive (Average rotational latency), So an IO operation takes around 10ms. What does this number mean? It will also be mentioned below.

POSIX also defines Fdatasync, which relaxes the semantics of synchronization to improve performance:

1 #include <unistd.h>2 int fdatasync (int fd);
The Fdatasync function is similar to the Fsync, but synchronizes metadata only when necessary, thus reducing the IO write operation to one time. So what is the "necessary situation"? According to the explanation in the man page:
"Fdatasync does not flush modified metadata unless this metadata is needed in order to allow a subsequent data retri Eval to be corretly handled. "
For example, if the size of the file (st_size) changes, it needs to be synchronized immediately, otherwise the OS crashes, even if the data portion of the file is synchronized, because metadata is not synchronized, the modified content is still not read.  The last access Time (atime)/modification Time (mtime) is not required to be synchronized every time, as long as the application has no harsh requirements for these two timestamps, basically harmless. The Ps:open parameter o_sync/o_dsync has the same semantics as Fsync/fdatasync: Each write will block waiting for the hard disk IO to complete.  (In fact, Linux does the same for O_sync/o_dsync, does not meet POSIX requirements, but all implements the semantics of Fdatasync) relative to Fsync/fdatasync, such a setup is not flexible enough, should be used infrequently. 3. Using the Fdatasync Optimization log synchronization article has been mentioned at the beginning, in order to meet the transaction requirements, the database log files are often required to synchronize IO. Because of the need to wait for the hard disk IO to complete synchronously, the commit operation of a transaction is often time-consuming and a bottleneck for performance.
under Berkeley DB, if Auto_commit is turned on (all independent writes automatically have transactional semantics) and the default synchronization level is used (the log is fully synchronized to the hard disk to return), writing a record takes approximately 5~10ms levels. The basic and one IO operation (10ms) takes the same time.
We already know that fsync on synchronization is inefficient. However, if you need to use Fdatasync to reduce the metadata update, you need to make sure that the size of the file does not change before or after the write. Log files are inherently additional (append-only), always growing, and it seems difficult to take advantage of good fdatasync. and see how Berkeley DB handles log files:
1. Each log file is fixed to 10MB size, numbering starts at 1, and the name format is "log.%0 10d "2. Each time a log file is created, the last 1 page of the file is written, Expand the log file to 10MB size 3. When appending records to a log file, using Fdatasync can greatly optimize the efficiency of writing the log, since the file size does not change 4. If a log file is full, create a new log file with only one synchronization metadata overhead

Linux system calls---sync Io:sync, Fsync, and Fdatasync "Go"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.