About the atomicity of O_append mode write

Last Update:2014-06-08 Source: Internet

Author: User

Tags strcmp

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Last week's events, the Dragon Boat Festival small long holiday near, or as usual, the last day must be engaged in a "subject", the scene is not big, one day can fix things, if said system learning vim or emacs, that even ... Fortunately, the problem is called, that is, the write system call is not atomic, the answer is obvious, not! But the master said that with the Append logo write is atomic, many software logs are o_append open, and then without locking the case directly write, there is no problem, how to confirm? This article gives an answer.
Once tangled in Linux write system call is not atomic, the answer is obvious, not! Why not? This question is not a good answer, this article tries to explain it in a simple way. In addition, this article will explain why the o_append way of write is atomic, is also a simple way, only to do experiments or thought experiments, do not speak code. But as a basis, I give a pseudo-implementation of an important structure:
1.inode Structure
Represents a file entity in which only one Inode object corresponds to a file on each disk.
2.file Structure
Represents a file entity in a process that needs to manipulate a file (that is, an inode) and each process that opens it independently has a separate file object for the inode. The object has a POS pointer that represents the current location of a file, whether read or write starts here.
3.task Structure
Manipulate the body of file.
Referring to the write operation, the most basic is the question of where to start writing, that is, the current position of the file. The semantics of a write system call is, starting from position, write the parameter buff length len, only that, the specific write is very simple, is the memory copy, cache management, finally to block device can, so the key is, position positioning. The positioning method is divided into 3 types:
1. Call Lseek manual positioning;
2. Automatic positioning according to the historical write operation;
3. According to the O_append logo automatic positioning;
Lseek manual positioning is simple, that is, to set the file's POS pointer, according to the historical write operation to automatically locate the best understanding, such as you write n bytes, then file pos forward N, at the beginning of the write operation to get file Pos, and then start write, When write is complete, reset the pos for file based on the amount actually written. The O_append method is completely unrelated to POS because it does not use the file's POS to position the start of the write, but rather locates the inode size, which is the start of the write to the end of the file.
Well, so far, we've done the positioning of the current position, and then it's time to write, and now the question is, is it possible that one write can be affected by another write, and for a simpler analysis, I assume that each time I write buffer once ( Because a buffer can be written multiple times in a multi-process environment will certainly appear to cross, no doubt! ), which is how much the write's count parameter is, and the return value of write. First I process a write operation, assuming that each write has a data length of 100, thread a writes 100 A, thread B writes 100 B:
L1.get_pos
L2.write_buffer
L3.update_pos
The following are several scenarios to discuss.
Scenario 1:
Thread A is in L2, thread B enters L1, and no doubt two threads will get the same POS, and when thread B immediately follows thread A into L2, thread B will be able to erase the data just written by thread A.
Scenario 1-1: I define three points in the direction of the passage of time in L2, L2 just beginning (the point at which the first byte is to be written), sometime in the middle, at the end of the L2 (100th byte of the point, 100 is our hypothesis), respectively, T1,t2,t3.
Thread A in time T2 is dispatched from the CPU, no longer running, the cause may be the RT process to attack, or the time slice exhausted ... Anyway, it no longer runs, thread B goes into T1, and thread A has written a number of a, assuming 40, then thread B runs to T3, and the 100 bytes written are all B. Thread B is detached from L2, at which point A is pulled back to the CPU, starting with the 41st byte and writing to the 60-byte end-of-L2, where the contents of the file are the front 40 B, followed by 60 a.
Analysis: There is no doubt that the above scenario concludes that, in a one-time write, there will be no cross, but only overlay, and how the specific coverage is indeterminate, there is full coverage, but also the above scenario 1-1 described in the incomplete coverage, but generally will not appear incomplete coverage of the situation, Even if more than one thread writes the number of bytes per file equal, 100% does not appear! Why is it? This is a key design, that is, the process of L2 is not interrupted, that is, it is atomic. No matter what the pattern of Write,write itself is atomic, such as you want to write x bytes of data, but for some reason only X-y bytes are written, then the process of writing X-y byte data is atomic, so-called write non-atomic scene refers to POS positioning and write between the paragraph, Individual POS positioning and write any one, are atomic.
For the convenience of the following discussion, I re-process the write operation:
L1.get_pos
L2-0.lock_inode
L2-1.write_buffer
L2-2.unlock_inode
L3.update_pos
Therefore, the so-called non-atomic write caused by the accident will only happen between L1 and L2 and L2 and L3!
Scenario 2: Thread A goes into L2 before thread B, but yields the CPU between L2 and L3, causing thread B to overwrite thread A's data, then thread B first out of L3, set POS according to its write length, causing thread A to be re-pulled back to the CPU, and POS is set back.
Dragon Boat Festival before the last working day, colleagues in the tangled in a problem, why NGX or Apache write log is directly written, why not lock,write since it is non-atomic, is not afraid of chaos? Really did not mess up, also really did not lock, in the end why? According to the above analysis, frequently write, should be disorderly! Because I am not familiar with the code of NGX, I do not look closely, I think it seems to use the O_APPENDB logo open file. Where is the O_append sacred? To reveal it, I further expanded the write process for O_append mode:
L1.get_pos
L2-0.lock_inode
L2-1.change_pos_to_inode->size
L2-2.write_buffer
L2-3.update_inode->size
L2-4.unlock_inode
L3.update_pos
I would like to stop, needless to say, should also know why the O_append mode open files will be atomic operations, multiple threads or processes casually write, will not cross, not overwrite. But again, if a write does not finish a buffer and writes several times, even the O_append-mode file write will cross because there is no mechanism to protect two write-times.
Through the above analysis, we can see that the actual writing process is absolute lock, but write system calls in addition to the real write, but also include the positioning of POS, this position occurs after lock or before the decision of the call of the write is atomic or non-atomic.
Annotations: Scenario 2 analog Code
To tell the truth, on the modern CPU to reproduce scene 2 caused by the phenomenon is particularly difficult, dozens of lines of code you look very tired, for the CPU, the blink of an eye to execute, so must be emulated implementation, in the MM/FILEMAP.C generic_file_aio_write function of the mutex Add the following code to the _unlock (you can also use jprobe to delay it):

if (!strcmp (Current->comm, "Child")) {#include <linux/sched.h>    struct task_struct *pp = Current->real_ parent;       while (pp &&!strcmp (pp->comm, "parent")) {        schedule_timeout (1);    }}

The code was added to simulate a scenario where thread A was dispatched, and since I knew it would be a problem when I dispatched and thread B caught up with thread A, and it did happen, I just didn't know when it was going to happen, so I created the illusion that it happened.
As for how to design the corresponding application, alas ... fork+exec.
Linus's Way of coping
The Linus solution to the atom write is super beautiful, look at his style:
Redefine the two pos_read/write with the lock mechanism, the general point is to set a lock for POS:

+static inline loff_t file_pos_read_lock (struct file *file) {+if (File->f_mode & Fmode_lseek) +mutex_lock (& File->f_pos_lock); Return file->f_pos; }+static inline void file_pos_write_unlock (struct file *file, loff_t pos) {file->f_pos = Pos;+if (File->f_mode & ; Fmode_lseek) +mutex_unlock (&file->f_pos_lock); }

Modify the Sys_write system call:

File = Fget_light (fd, &fput_needed); if (file) {-loff_t pos = file_pos_read (file); +loff_t pos = file_pos_read_lock (file); ret = vfs_write (file, buf, Count, &am P;pos);-file_pos_write (file, POS); +file_pos_write_unlock (file, POS); Fput_light (file, fput_needed); }

This fast track approach style sharply points out the solution to the problem, in fact, most of the complexity is the byproduct of optimization!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More