Linux sync Io:sync, Fsync, and Fdatasync

Last Update:2015-05-05 Source: Internet

Author: User

Tags flushes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://blog.csdn.net/sishuiliunian0710/article/details/37739385

I. Terminology interpretation
Dirty page: The concept of the Linux kernel, because the hard disk read and write faster than the speed of memory, the system will read and write more frequent data in advance in memory to improve read and write speed, this is called Cache, Linux is a page as a cache unit, when the process modifies the data in the cache, The page is marked as dirty by the kernel, and the kernel will write the dirty page data to disk at the appropriate time to keep the data in the cache consistent with the data in the disk.

Memory mapping: A memory-mapped file that is mapped by a file into a piece of memory. Win32 provides a function that allows an application to map files to a process (createfilemapping). Memory-mapped files are somewhat similar to virtual memory, where a memory-mapped file preserves an area of an address space, while the physical memory is submitted to this zone, and the physical memory of the memory-file mapping comes from a file that already exists on disk, and the file must be mapped first before it can be manipulated. When you use a memory-mapped file to process files stored on disk, you will no longer have to perform I/O operations on the files, making the memory-mapped files a significant part of processing large data volumes.

Excerpt from Baidu Encyclopedia

Deferred write (Delayed write): Traditional UNIX implementations have a buffer cache or page cache in the kernel, and most disk I/O is buffered. When writing data to a file, the kernel typically copies the data into one of the buffers, and if the buffer is not yet full, it is not queued to the output queue, but waits for it to be full, or when the kernel needs to reuse the buffer to hold other disk block data, and then queues the buffer to the output queue before it reaches the head of the team. To perform the actual I/O operation. This type of output is called deferred write.

Excerpt from the third edition of Advanced Programming in the UNIX environment P65
Second, the text

Deferred write reduces disk read and write times, but reduces file content updates so that data written to the file is not written to disk for a period of time. In the event of a system failure, this delay may result in the loss of file update content. To ensure consistency between the actual file system on the disk and the content in the buffer cache, the UNIX system provides sync, Fsync, and Fdatasync three functions.

1. Sync function

The sync function simply queues all the modified block buffers into the write queue and then returns, not waiting for the actual write disk operation to end.
The system daemon, commonly referred to as update, periodically calls the sync function (typically every 30 seconds). This ensures that the block buffers of the kernel are flushed periodically. Command Sync (1) also calls the Sync function.
2. Fsync function
The Fsync function works only on a single file specified by the file descriptor Filedes, and waits for the write disk operation to end and then returns.
Fsync can be used in applications such as databases, which need to ensure that modified blocks are immediately written to disk.
3. Fdatasync function
The Fdatasync function is similar to Fsync, but it affects only the data portion of the file. In addition to the data, Fsync also synchronizes the properties of the updated file.
For a database that provides transactional support, when a transaction commits, it is necessary to ensure that the transaction log (which contains the modification operation and a commit record) is fully written to the hard disk before the transaction is committed and returned to the application tier.

4. Fflush: Standard IO functions (such as fread,fwrite, etc.) create buffers in memory, which refreshes the memory buffer, writes the content to the kernel buffer, and calls Fsync if it is to be actually written to disk. (That is, call Fflush before calling Fsync, otherwise it will not work). Fflush the specified file stream descriptor as a parameter (corresponding to a file stream opened with functions such as fopen), simply flushes the data in the upper buffer to the kernel buffer and returns

So it's not very secure relative to Fsync, and you need to call Fsync again to actually write the data to the hard disk. Using functions

[CPP]View Plaincopy

int Fileno (FILE *stream);

Converting a file Stream descriptor (FP) to a file descriptor (FD) to facilitate fsync calls, how can you ensure that the data is correctly written to the external persistent storage media on the Linux operating system?

1. Write does not meet the requirements, need Fsync
For the write function, we think that once the function returns, the data is written to the file. However, this concept is only macro, in general, write operations on the hard disk (or other persistent storage device) files, update only the in-memory page cache, and dirty pages are not immediately updated to the hard disk, but by the operating system unified scheduling, If a flusher kernel thread meets certain conditions (time interval, in-memory
The dirty pages to a certain percentage) to synchronize the dirty pages to the hard disk (into the device's IO request queue). Because the write call does not return after the hard disk IO is complete, it is assumed that if the operating system crashes after the write call and before the hard disk is synchronized, the data may be lost. Although such a time window is small, the "loosely asynchronous semantics" provided by write () is not enough for a database program that needs to ensure transactional persistence (durability) and consistency (consistency). The synchronous IO (Synchronized-io) primitives provided by the operating system are usually required to guarantee:

Function Prototypes:

[CPP]View Plaincopy

int fsync (int fd);

The function of Fsync is to make sure that all modified content of file FD has been correctly synced to the hard disk, and that the call will block wait until the device reports IO completion.

PS: If you use a memory-mapped file for file IO (using mmap to map the file's page cache directly to the address space of the process, modify the file by writing memory), there are similar system calls to ensure that the modified content is fully synchronized to the hard disk:

[CPP]View Plaincopy

#incude <sys/mman.h>
int Msync (void *addr, size_t length, int flags)

Msync need to specify a synchronized address range, so fine-grained control seems more efficient than fsync (because the application usually knows its dirty page location), but in fact (Linux) kernel has a very efficient data structure that can quickly find the dirty pages of a file. Allows Fsync to synchronize only the modified contents of the file.

2. Difference between Fsync and Fdatasync

In addition to synchronizing the modified content (dirty pages) of the file, Fsync also synchronizes the file's descriptive information (metadata, including size, access time, and so on), because the file's data and metadata usually exist in different places of the hard disk, so fsync requires at least two IO writes, An extra IO operation, according to the Wikipedia data, the current drive average seek time (Average seek times) is approximately the average rotational delay of the 3~15ms,7200rpm hard drive (Average rotational Latency) is approximately 4ms, so a single IO operation takes about 10ms or so. POSIX also defines Fdatasync, which relaxes the semantics of synchronization to improve performance:

[CPP]View Plaincopy

int fdatasync (int fd);

The Fdatasync function is similar to Fsync, but synchronizes only when necessary, thus reducing one IO write operation.

"Fdatasync does not flush modified metadata unless this metadata is needed in order to allow a subsequent data retrieval T o be corretly handled. "

Excerpt from The Man Handbook

For example, if the size of the file (st_size) changes, it needs to be synchronized immediately, otherwise the OS crashes, even if the data portion of the file is synchronized, because metadata is not synchronized, the modified content is still not read. The last access Time (atime)/Modify Time (Mtime) is not required to synchronize each time, as long as the application has no stringent requirements for these two timestamps, basically no problem.

Add: The function open parameter O_sync/o_dsync has a similar meaning to Fsync/fdatasync: Each write will block waiting for the hard disk IO to complete.

The

O_sync causes each write to wait for the physical I/O operation to complete, including the I/O required to update the file properties caused by the write operation. The
O_dsync causes each write to wait for the physical I/O operation to complete, but if the write does not affect the reading of the data just written, you do not have to wait for the file property to be updated.
Note the difference:
o_dsync and O_sync flags have subtle differences: only if the file attributes need to be updated to reflect changes in file data (for example, if the file size is updated to reflect more data in the file), O_ The DSYNC flag only affects file properties. When the O_SYNC flag is set, the data and properties are always updated synchronously. When a file is opened with the O_dsyn flag, the file time attribute is not updated when overwriting its existing portions. In contrast, if the file is opened with the O_SYNC flag, each write to the file will update the file time before write returns, regardless of whether to overwrite the existing byte or append file. This is not flexible enough for fsync/fdatasync, and should be used infrequently.

3. Using Fdatasync to optimize log synchronization (from http://blog.csdn.net/cywosp/article/details/8767327)
To satisfy the transaction requirements, the log files of the database are often required to synchronize IO. Because of the need to wait for the hard disk IO to complete synchronously, the commit operation of a transaction is often time-consuming and a bottleneck for performance. Under Berkeley DB, if Auto_commit is turned on (all independent writes automatically have transactional semantics) and the default synchronization level is used (the log is fully synchronized to the hard disk to return), writing a record takes approximately 5~10ms levels. The basic and one IO operation (10ms) takes the same time.
We already know that fsync on synchronization is inefficient. However, if you need to use Fdatasync to reduce the metadata update, you need to make sure that the size of the file does not change before or after the write. Log files are inherently additional (append-only), always growing, and it seems difficult to take advantage of good fdatasync.
Berkeley DB is the procedure for processing log files:
1. Each log file is fixed to 10MB size, numbering starts at 1, and the name format is "log.%0 10d "
2. Each time the log file is created, the last 1 page of the file is written, and the log file is expanded to 10MB size
3. When appending records to the log file, using Fdatasync can greatly optimize the efficiency of writing log because the size of the file does not change.
4. If a log file is full, create a new log file with only one synchronization metadata overhead
Iii. Summary
1, if the command to write to all buffers, you should use the Sync function, but you should be aware that the function just put the command in the queue to return, you need to be aware of programming.
2, if you want to commit a file that has been opened to the hard disk, you should call the Fsync function, which will be returned after the data is actually written to the hard disk, so it is the safest and most reliable way.

3, if it is an open file stream operation, you should first call the Fsync function to synchronize the changes to the kernel buffer, and then call Fsync to modify the actual synchronization to the hard disk.

Iv. Manual on the Fsync,fdatasync section

Fsync () transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred To by the file
Descriptor FD to disk device, or other permanent storage device, so, all changed information can is retrieved even After the sys‐
TEM crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device
Reports the transfer has completed. It also flushes metadata information associated with the file (see Stat (2)).
Calling Fsync () does not necessarily ensure so the entry in the directory containing the file has also reached disk. For a
Explicit Fsync () on a file descriptor for the directory is also needed.
Fdatasync () is similar to Fsync (), but does not flush modified metadata unless that metadata was needed in order to all ow a subsequent
Data retrieval to is correctly handled. For example, changes to St_atime or st_mtime (respectively, time of last access and time of the last
modification; See Stat (2)) does not require flushing because they is not necessary for a subsequent data read to be handled correctly. On
The other hand, a change to the file size (st_size, as made by say Ftruncate (2)), would require a metadata flush.
The aim of Fdatasync () is-to-reduce disk activity for applications that does not require all metadata to be synchronized wit H the disk.

Linux sync Io:sync, Fsync, and Fdatasync

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More