Ensuring data reaches Disk__storage

Source: Internet
Author: User
Tags close close flushes volatile

from:http://lwn.net/articles/457667/


In a perfect world, there would is no operating system crashes, power outages or disk failures, and programmers wouldn ' t h Ave to worry about coding for corner cases. Unfortunately, these are failures are more common than one would expect. The purpose of this document are to describe the path data takes from the application down to the storage, concentrating on Places where data is buffered, and to then provide best practices for ensuring the data is committed to stable storage so it Is isn't lost along the way in the case of a adverse event. The main focus are on the "C programming language, though the system calls mentioned should translate fairly easily to most Other languages. I/O buffering


In order to program for data integrity, it's crucial to have a understanding of the overall system architecture. Data can travel through several layers before it finally reaches stable, as storage seen:


[Data
Flow Diagram]
At the "top" is the running application which has data this it needs to save to stable. That data starts out as one or more blocks of memory, or buffers, in the application itself. Those buffers can also is handed to a library and which may perform its own. Regardless of whether data is buffered in application buffers or by a library, the data lives in the application ' s address Space. The next layer that's the data goes through is the kernel, which keeps its own version of a write-back cache called the page Cache. Dirty pages can live in the page cache for a indeterminate amount of time, depending on overall system load and I/O Patte RNs. When dirty the data are finally evicted from the kernel's page cache, it is written to a storage device (such as a hard disk). The storage device may further buffer the data in a volatile write-back cache. The If power is lost while the data was in this cache, the data would be lost. Finally, at the very bottom of the "stack is" the non-volatile storage. WHen the data hits this layer, it is considered to be "safe."


To further illustrate the layers of buffering, consider an application this listens on a network socket for connections a D writes data received from all client to a file. Before closing the connection, the server ensures the received data is written to stable storage, and sends a ACKNOWLEDG ment of such to the client.


After accepting a connection from client, the application would need to read data from the network socket into a buffer. The following function reads the specified amount of data from the network socket and writes it out to a file. The caller already determined from the client how much data is expected, and opened a file stream to write the data to. The (somewhat simplified) function below is expected to save the data read from the network socket to disk before G.


0 int
1 sock_read (int sockfd, FILE *outfp, size_t nrbytes)
2 {
3 int ret;
4 size_t written = 0;
5 Char *buf = malloc (my_buf_size);
6
7 if (!BUF)
8 return-1;
9
Ten while (written < nrbytes) {
ret = Read (SOCKFD, buf, my_buf_size);
if (ret =< 0) {
if (errno = = eintr)
Continue;
return ret;
16}
Written + ret;
ret = fwrite (void *) BUF, ret, 1, OUTFP);
if (Ret!= 1)
Ferror (OUTFP);
21}
22
ret = Fflush (OUTFP);
if (ret!= 0)
return-1;
26
ret = Fsync (Fileno (OUTFP));
if (Ret < 0)
return-1;
return 0;
31}
Line 5 is a example of an application buffer; The data read from the "socket is" put in this buffer. Now, since the amount of data transferred was already known, and given the nature of network communications (they can BU Rsty and/or slow), we ' ve decided to use libc ' s stream functions (Fwrite () and fflush (), represented by "Library buffers" I n the figure above) in order to further buffer the data. Lines 10-21 Take care of reading the data from the socket and writing it to the file stream. All data has been written to the file stream. On line, the file stream is flushed, causing the "Kernel buffers" layer. Then, on line, the data is saved to the "stable Storage" layer shown.


I/O APIs


Now so we ' ve hopefully solidified the relationship between APIs and the layering model, let ' s explore the intricacies of The interfaces in a little more detail. For the sake of this discussion, we'll break I/O down into three different categories:system I/O, stream I/O, and memory Mapped (mmap) I/O.


System I/O can is defined as any operation this writes data into the storage layers accessible only to the kernel ' s addres s space via the kernel ' s system call interface. The following routines (not comprehensive, the focus are on write operations here) are the system (call) interface:


Operation Function (s)
Open Open (), creat ()
Write Write (), Aio_write (), Pwrite (), Pwritev ()
Sync Fsync (), Sync ()
Close Close ()
Stream I/O is I/O initiated using the C library ' s stream interface. Writes using these functions could not be result in system calls, meaning this data still lives in buffers in the Applicati On ' s address spaces after making such a function call. The following library routines (not comprehensive) are part of the stream interface:


Operation Function (s)
Open fopen (), Fdopen (), Freopen ()
Write fwrite (), FPUTC (), fputs (), PUTC (), Putchar (), puts ()
Sync Fflush (), followed by Fsync () or sync ()
Close Fclose ()
Memory mapped files are similar to the system I/O case above. The Files are still opened and closed using the same interfaces, but access to the file data is performed by mapping of that data Into the process ' address spaces, and then performing memory read and write operations as you would with any other Applica tion buffer.


Operation Function (s)
Open Open (), creat ()
Map mmap ()
Write memcpy (), Memmove (), read (), or any other routine that writes to application memory
Sync Msync ()
Unmap Munmap ()
Close Close ()
There are two flags that can is specified when opening a file to change its caching behavior:o_sync (and related O_dsync) , and O_direct. I/O operations performed against files opened with O_direct bypass the kernel ' s page cache, writing directly to the Storag E. Recall that the "storage may itself store" the data in a write-back cache, so Fsync () is still required for files opened With O_direct in order to save the data to stable storage. The O_DIRECT flag is only relevant for the system I/O API.


Raw devices (/dev/raw/rawn) are a special case of o_direct I/O. These devices can be opened without specifying O_direct, but still DIRECT I/O provide. As such, all of the same rules apply to raw devices which apply to files (or devices) opened with O_direct.


Synchronous I/O are any I/O (System I/O with or without o_direct, or stream I/O) performed to a file descriptor this was op Ened using the O_sync or O_dsync flags. These are the synchronous modes, as defined by POSIX:


O_sync:file data and all File metadata are written synchronously to disk.
O_dsync:only file data and metadata needed to access the file data are written to disk.
O_rsync:not implemented
The data and associated metadata for write calls to such file descriptors end up immediately on stable. Note the careful wording, there. Metadata that isn't required for retrieving the data of the ' file may ' is written immediately. That metadata may include the file ' s access time, creation time, and and/or modification time.


It is also worth pointing out the subtleties of opening a file descriptor with O_sync or o_dsync, and then tha T file descriptor with a libc file stream. Remember that fwrite () s to the file pointer are buffered by the C library. It is isn't until an fflush () the ' call ' is issued ' the ' data is ' known to ' is ' written to disk. In essence, associating a and a file stream with a synchronous file descriptor means of that a fsync () are not needed on the F Ile descriptor after the Fflush (). The Fflush () call, however, is still necessary.


When Should you fsync?


There are some simple rules to follow to determine whether, or not, Fsync () is necessary. and foremost, you must answer the Question:is it important so this data are saved now to stable storage? If it ' s scratch data, then you probably don ' t need to fsync (). If it ' s data that can is regenerated, it might not being that important to Fsync () it. If, on the other hand, you ' re saving the result of a transaction, or updating a user ' s configuration file, you very likely Want to get it right. In the cases, use Fsync ().


The more subtle usages deal with newly created files, or overwriting existing files. A newly created file may require A Fsync () to not just the file itself, but also to the directory in which it is was created (since this are where the file system looks to find your file). This behavior is actually file system (and mount option) dependent. Can either code specifically for each file system and mount option combination, or just perform fsync () calls on the D Irectories to ensure that your the code is portable.


Similarly, if you are encounter a system failure (such as power loss, ENOSPC or a I/O error) while overwriting a file, it can Result in the loss of existing data. To avoid this problem, it's common practice (and advisable) to write the updated data to a temporary file, ensure that it is safe in stable storage, then rename the temporary file to the original file name (thus replacing the contents). This is ensures an atomic update of the "file, so" other readers get one copy of the data or another. The following steps are required to perform this type of update:


Create a new temp file (on the same file system!)
Write data to the temp file
Fsync () the temp file
Rename the temp file to the appropriate name
Fsync () The containing directory
Checking for Errors


When performing write I/O buffered by the library or the kernel, errors may isn't reported at the time of the WRI Te () or the fflush () call, since the "Data may" is written to the page cache. Errors from writes are instead often reported during to calls (), Fsync () or close (). Therefore, it is very important to check the "return" values of these calls.


Write-back Caches


This section provides some general information on disk caches, with the control of such caches by the operating system. The options discussed in this section should not affect to how a is constructed Ended for informational purposes.


The Write-back cache on a storage device can come in many different. There is the volatile write-back cache, which we ' ve been assuming throughout this document. Such a cache is lost upon power failure. However, most storage devices can be configured to run in either a cache-less mode, or in a Write-through caching mode. Each of these modes won't return success for a write request until the request are on stable storage. External storage arrays often have a non-volatile, or battery-backed write-cache. This configuration also would persist data in the event's power loss. From a application programmer ' s point of view, there are no visibility into these parameters, however. It is best to assume a volatile cache, and program defensively. In cases where the data is saved, the operating system would perform whatever optimizations it can to maintain the highest Performance possible.


Some file systems provide mount options to control cache flushing behavior. For Ext3, Ext4, XFS and Btrfs as of kernel version 2.6.35, the Mount option is "-o barrier" to turn barriers (write-back C Ache flushes) on (the default), or "-O nobarrier" to turn barriers off. Previous versions of the kernel may require different options ("-O barrier=0,1"), depending on the file system. Again, the application writer should not need to take this options into account. When barriers are disabled for a-file system, it means that fsync would not be result in the calls of disk flushing. It is expected that administrator knows that cache flushes are not required before she specifies this mount option .


Appendix:some examples


This section provides example the code for common the tasks that application programmers often to need.


Synchronizing I/O to a file stream
Synchronizing I/O using file descriptors (System I/O) is actually a subset of The O_direct open flag (so'll work whether or not, flag was specified).
Replacing an existing file (overwrite).
Sync-samples.h (needed by the above examples).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.