Mmap and direct io differentiate "go"

Source: Internet
Author: User

Transferred from: http://www.cnblogs.com/zhaoyl/p/5901680.html

After reading this article, the title is self-evident. Transfer from http://blog.chinaunix.net/uid-27105712-id-3270102.html

In Linux development, there are several things related to performance, technical staff are very concerned about: process, cpu,mem, network io, disk IO. This paper intends to be detailed and comprehensive, in layman's terms. Parse the details of the file IO. Explore how to improve IO performance from multiple perspectives. In this paper, we try to explain it in an easy-to-understand perspective. Do not copy the kernel code.

Before the elaboration, we must first have a large perspective, let us stand at million meters high altitude, aerial view of our file Io, their design is layered, layered has 2 benefits, one is the structure is clear, the second is decoupling. Let's take a look at the picture below.

1. The way to write files across layers

The ultimate goal of the program is to write the data to disk, but the system provides a compromise solution from a commonality and performance perspective to ensure these. Let's look at one of the most commonly used write files Typical example, also the longest path of IO.

{

Char *buf = malloc (max_buf_size);

strncpy (buf, SRC,, max_buf_size);

Fwrite (BUF, max_buf_size, 1, FP);

Fclose (FP);

}

The buf of malloc here is for application buffer in the layer, the application buffer, and after calling Fwrite, the data is copied from application buffer to the CLIB buffer, the C library standard Iobuffer. After Fwrite returns, the data is still clib buffer, if the process core is dropped. This data is lost. Not written to disk media. When Fclose is called, the fclose call flushes the data to disk media. In addition to the Fclose method, there is an active refresh Operation Fflush function, but the Fflush function simply copies the data from Clib buffer to the page cache and does not flush to disk, from the page The cache flush to disk can be done by calling the Fsync function.

From the above class sub-see, a common fwrite function process, basically after a lot of hardships, the data after several copy, only to reach the destination. Some people have doubts, this will improve performance, but will reduce performance. Let's put this problem in the first place.

Some people say that I do not want to fwrite+fflush this combination, I want to write directly to page cache. This is our common file IO call read/write function. These functions are basically a function that corresponds to a system call, such as Sys_read/sys_write. The write function is called to copy data from the application layer to the kernel layer directly from the system call, copying from application buffer to page cache.

System call, write triggers the user-state/kernel-state switch? Yes. There is no way to avoid these costs. At this time the mmap appeared, mmap the page cache address space mapped to the user space, the application like the application layer of memory, write files. Eliminates the overhead of system calls.

And if you want to go back to the page cache, what if you just send the data to the disk device. With the O_direct parameter on the open file, this is the write file. is to write directly to the device.

If you continue to compete, write directly to the sector there is no way. This is the so-called raw device write, bypassing the file system, the direct write sector, like Fdsik,dd,cpio and other tools such as this kind of operation.

2. IO Call Chain

Listed above various through the various cache layer write operations, you can see the system provides a very rich interface to meet your various write requirements. Let's take a look at the call chain of the file IO by explaining figure I.

The fwrite is the top-level interface provided by the system and the most commonly used interface. It creates a buffer in the user process space, caches the contiguous writes of multiple small data volumes, merges them, and eventually calls the write function to write once (or to decompose chunks of data multiple write calls).

The Write function will copy the data from the application layer to the kernel layer by invoking the system call interface, so write will trigger the kernel / User State switch . When the data reaches the page cache, the kernel does not immediately pass the data down. Instead, the user space is returned. When the data is written to the hard disk, there is a kernel IO dispatch decision, so write is an asynchronous call . This is different from read, the read call is to check whether there is data in the page cache, if there is, take out to return to the user, if not, the synchronization is passed down and wait for the data, and then return the user, so read is a synchronization process. Of course you can also change the asynchronous process of write to synchronous process, that is, the open file with the O_sync tag.

After the data has reached the page cache, the kernel has pdflush threads that constantly detect dirty pages and determine whether to write back to disk. Submit the page that needs to be written back to the IO queue-the IO dispatch queue. Also the IO schedule queue scheduling policy is dispatched when it is written back.

Refer to the IO scheduling queue and have to mention the disk structure. Here to talk about, the head and elevator, as far as possible to go to the end of the back, to avoid back and forth to seize is running, the disk is unidirectional rotation, will not repeatedly counterclockwise clockwise turn. Copy a map from the internet, specifically not described here.

The IO queue has 2 main tasks. The first is to merge adjacent sectors, but sort them. Merging is believed to be easy to understand, sorting is as far as possible according to the disk selection direction and head forward direction. Because the head seek time is and is expensive.

The IO queue here is closely related to the IOSTAT analysis tools we commonly use. Iostat rrqm/s wrqm/s Indicates the number of read and write merges. The Avgqu-sz represents the average queue length.

There are several IO scheduling algorithms in the kernel. When the hard disk is SSD, there is no track head, people are randomly read and write, plus these scheduling algorithm instead of Lily. OK, just have a scheduling algorithm called NoOp scheduling algorithm, is what is good (merger is done). A system that can be used to configure SSD drives.

After coming out of the IO queue, the driver layer (of course, there are more subdivision layers in the kernel, ignored here), the drive layer through the DMA, the data to the disk cache.

As for disk cache when writing to disk media, that is the disk controller's own business. If you want to sleep, make sure you write to disk media. Just call the Fsync function. Can be determined to write to the disk.

3. Consistency and security

Talk about the details of the call, and then the consistency and security issues. Since the data does not reach the disk media before, it may be in a different physical memory cache, then if the process freezes, the kernel dies, power-down such events occur. Will the data be lost?

When the process freezes: data is lost when the data is still in application cache or Clib cache. The data is to page cache. The process core is dropped even though the data is not yet on the hard drive. Data is not lost.

When the core core is dropped, the data will be lost as long as the data does not reach disk cache.

Power-down situation, haha, when God can not save you, cry.

So consistency, if two processes or threads write at the same time, will write chaos? Or a process write, b process read, will write dirty?

The article is written here, too long, to cite a variety of examples. Tell me about the principle of judgment. The fwrite operation buffer is in the process private space, two threads read and write, and must be protected by a lock. If the process, each has its own address space. If you want to lock, see the application scenario.

Write operation if the written size is less than pipe_buf (typically 4096), is atomic operation, can guarantee two process "AAA", "BBB" write operation, will not appear "ABAABB" such data interleaving. The O_APPEND flag guarantees the atomicity of the end of the file every time the POS is recalculated.

Once the data is in the kernel layer, the kernel locks up and guarantees consistency.

4. Performance issues

Performance is analyzed at the system level and at the device level, and the physical characteristics of the disk fundamentally determine performance. IO scheduling strategy, System call is also a deadly killer.

The disk seek time is quite slow, the average seek time is probably at 10ms, that is, only 100-200 times per second seek.

Disk speed is also the key to affect performance, currently the fastest 15000rpm, about 500 rpm per second, uttered, let the head do not seek, imagine all the data stored continuously on a cylinder. You can figure out how much data you can read per second. Of course this is the theoretical value. In general, the disc turns too fast, the head induction can not keep up, so need to turn a few laps to fully read the contents of the track.

In addition, the Device interface bus transfer rate is the upper limit of the actual rate.

In addition, there are a number of equal density disks, more disk peripheral track sector, faster line speed, if the data is frequently manipulated in the peripheral sector, can also improve performance.

The use of multi-disk concurrent operations, but also to improve performance means.

Here give an industry experience value: Mechanical hard disk sequential write ~30MB, sequential read rate generally ~50MB good can reach more than 100 m, SSD read to achieve ~400MB,SSD write performance and mechanical hard disk almost.

O_directand theRAWThe most fundamental difference between devices isO_directis based on the file system, that is, in the application layer, its Operation object is the file handle, the kernel and the file layer, its operation is based onInodeand data blocks, these concepts are bothEXT2/3file system-related, writing to disk is ultimatelyext3file.

andRAWdevice Write is no file system concept, the operation of the sector area code, the operation of the object is sector, write out something is not necessarilyext3file (if you followext3The rules are writtenext3files).

generally based onO_directto design and optimize their own file module, is dissatisfied with the systemCacheand scheduling strategy, the implementation of their own in the application layer to develop their own unique business characteristics of the file read and write. But the thing that's written isext3file, the disk is unloaded,Mountto any otherLinuxcan be viewed on the system.

and based onRAWEquipment Design system, is generally dissatisfied with the existingext3To design your own file system for many defects. Design your own file layout and indexing methods. For an extreme example: write the entire disk as a file and do not index it. This is notInodelimit, no file size limit, how large the disk is, and how large the file can be. Such a disk is unloaded,Mountto otherLinuxThe data is not recognized on the system.

both are read and written through the drive layer, and when the system is booted and in real mode, it can beBIOSinterface Read and writeRawequipment.

The advantages and disadvantages of direct IO https://www.ibm.com/developerworks/cn/linux/l-cn-directio/

AIO and direct IO relationships and DMA http://blog.csdn.net/brucexu1978/article/details/7085924

"IO Model Matrix" http://www.ibm.com/developerworks/cn/linux/l-async/

Why Nginx introduces Multithreading: reducing the impact of blocking IO http://www.aikaiyuan.com/10228.html

"AIO mechanism in Nginx use: Output Chain" http://www.aikaiyuan.com/8867.html

"Nginx encapsulation of the Linux native AIO mechanism" http://www.aikaiyuan.com/8869.html

"Nginx Output chain analysis" https://my.oschina.net/astute/blog/316954

Kernel switching times for sendfile and Read/write http://www.cnblogs.com/zfyouxi/p/4196170.html

Mmap and direct io differentiate "go"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.