This paper mainly analyzes the memory and I/O related system calls and library functions of the implementation principle, according to the principle of the need to pay attention to the problem and optimization of the focus of the system, this article involves the call of Readahead,pread/pwrite,read/write,mmap, Readv/writev,sendfile,fsync/fdatasync/msync,shmget,malloc.
This article first introduces the application of memory usage and the I/O system to the memory use of the basic principle, which to understand the above system calls and library functions of the implementation of a great help.
1 Memory Management Basics
The management of physical memory by Linux is in pages, usually the page size is 4kb,linux when initialized, management data structures are allocated for all physical memory, and all physical pages are managed.
Each application has a separate address space, of course, the address is virtual, through the application's page table can be translated into the actual physical address of the virtual address to operate, although the system can be implemented from the virtual address to the physical address of the conversion, but not every block of virtual memory of the application corresponding to a piece of physical memory. Linux allocates physical memory to applications using an on-demand strategy, which is implemented using a fault-pages exception. For example, an application dynamically allocates 10MB of memory, which is allocated only in the application's virtual memory area management structure to indicate that the address of this interval has been occupied, the kernel does not allocate physical memory at this time, but when the application uses (read and write) the memory area, found that the memory address corresponds to the physical memory does not exist, at this time, resulting in a page fault, prompting the kernel for the current access to the virtual memory page allocation of a physical memory pages.
A physical page that has been assigned to an application, in some cases, is also reclaimed for other purposes, such as the dynamically allocated memory described above, whose contents may be swapped to the swap partition, the system temporarily reclaims the physical page, and when the application uses the memory page again, the system allocates the physical page, Swap the contents of the page back to this physical page from the swap partition, and then reestablish the page table mapping relationship. Different types of virtual memory pages corresponding to the allocation of physical memory of the recycling process is different, in the analysis of specific system call details, we do more detailed explanation.
2 File system I/O principle
The operating system I/O part involves not only the operation of the ordinary block device, but also the operation of the character device and the network equipment, this article only relates to the general block device description.
The operation of the application on the file can be implemented in two ways: normal read/write and mmap, but neither is the application directly manipulating the block device while reading and writing the file contents (there are some special exceptions), but rather, it passes the page cache of the operating system layer, that is, Regardless of how the application reads the file data, it is the operating system to load this part of the file data into the kernel layer, the application in different ways to manipulate the in-memory file data, the process of writing, the application actually just write the data to the file in memory corresponding to the page, Then at a certain time or forcibly write back to the block device.
We need to explain to page cache that the page cache can be understood as buffering of all file data, and in general, the file operation needs to pass through the middle layer of page cache (special case we will describe below), page The cache is not just a data transfer layer, in the page cache layer, the kernel of the data is effectively managed, temporarily unused data in the memory allows the case still placed in the page cache, all applications share a page cache, The performance of I/O operations is accelerated by the fact that different applications have access to the same block of file data, or if an application has multiple accesses to a piece of data without having to access the block device multiple times.
The difference in access to file data from different system invocations is that the data is accessed on top of the page cache, and the operation of the data between the page cache and the block device is essentially similar. This difference is mainly reflected in the Read/write mode and the Mmap way. They have their own details, which we'll describe separately below.
3 ReadAhead
After describing the principle and function of page cache, ReadAhead is easier to understand when using the system call read to read part of the file data, if the data is not in the page cache, you need to read the corresponding data from the block device, for a block device like disk, Pathfinding is the most time-consuming operation, and it takes a little bit of reading to read a small chunk of data and read a chunk of continuous data, but if this chunk of data is read multiple times, it will take a long time to find the path.
ReadAhead is based on the strategy: when the need to read a piece of data, if the subsequent operation is continuous reading, you can read some data into the page cache, so that the next time to access the continuous data, the data is already in the page cache, there is no need I/O operation, This can greatly improve the efficiency of data access.
Linux ReadAhead is divided into automatic mode and user force mode, automatic pre-reading refers to when the read system calls, if you need to transfer data from a block device, the system will automatically according to the current state set the size of the pre-read data, enable the pre-reading process. The size of each pre-read data is dynamically adjusted, and the principle of adjustment is to properly enlarge or reduce the read-ahead size based on pre-read hit conditions. The default size of each read-ahead can be set, and different block devices can have different default read-ahead sizes, and viewing and setting the block device default read-ahead size can be done through the Blockdev command.
The pre-read mechanism of this automatic mode is enabled before each I/O operation, so the setting of the pre-read default size has some effect on performance, and if it is a large number of random reads, it is necessary to make the pre-read value smaller in this case, but it is not as small as possible. In general, it is necessary to estimate the average size of the average amount of data read by the application per read request, and to set the read-ahead value to a slightly larger than the average size. If you have a large number of sequential reads, the read-ahead value can be turned up a bit (for use with raid The setting of the read-ahead value also refers to the stripe size and the number of bands).
Problems to be aware of in automatic pre-reading mode also, if the file itself has many small fragments, even if it is continuous reading, and also set a large read-ahead value, it is not too efficient, because if the data is read in the disk is not contiguous, it is still unavoidable disk seek, so pre-read the role is not much.
Linux provides a readahead system invocation setting to force pre-reading the file, which is to load the data from the block device into the page cache, which can increase the speed of access to the file data later, and the user can decide whether to use forced pre-reading according to their own needs.
4 Read/write
Read/write is the basic process of reading and writing I/O, except for mmap, the basic principle and calling procedure of other I/O read-write system calls are the same as read/write.
The read process: Converts the required read data into the corresponding page, performing the following procedure for each page that needs to be read: First call Page_cache_readahead (if read-ahead is on), based on the current read-ahead state and execute a read-ahead policy (read-ahead state structure dynamically adjusts based on hit and read patterns , the read-ahead policy is also dynamically adjusted), the pre-reading process will be I/O operation or not, after the pre-reading process is finished, first check whether the page cache has the necessary data, if not, the read-ahead is not hit, call Handle_ra_miss adjust the read-ahead policy, i/ The O operation reads the page data into memory and joins page cache, when the page data is read into page cache (or previously in page cache), marks the page mark_page_accessed, and then copies the page data to the application address space.
Write process: As with the read process, you need to convert the data you need to write to the corresponding page, copy the data from the application address space to the corresponding page, and mark the page state as dirty, call mark_page_accessed , If no synchronous write is specified, the write operation is returned. If the file is opened with O_sync specified, the system will write all the dirty pages involved in this write process back to the block device, the process is blocked. The synchronization of dirty pages is explained in detail when we analyze Fsync/fdatasync/msync.
Special case: If the application specifies o_direct when the file is opened, the operating system completely bypasses the page cache when reading and writing the file, and the data is transferred directly from the block device to the application-specified cache when it is written, and the data is written directly from the application-specified cache to the block device. Because the page cache layer is not passed, this method of writing is always written synchronously.
5 mmap
Mmap is widely used, it can not only map files to memory address space read and write files, can also use mmap to realize shared memory, malloc allocates memory is also used mmap, this section we first discuss the use of Mmap read and write files implementation.
Each process on the virtual memory is managed by the sub-region, when the virtual memory allocation, for different purposes divided into different virtual memory regions, these virtual memory areas at the beginning of the allocation did not allocate the corresponding physical memory, but only to allocate and set the management structure, when the process uses the memory of a certain area, And there is no corresponding physical memory, the system produces a fault, in the fault, the system according to the memory corresponding to the virtual memory management structure to allocate physical memory, if necessary (such as mmap) to load the data into this physical memory, establish virtual memory to the physical memory of the corresponding relationship, The process can then continue to access the virtual memory just now.
The implementation of Mmap is also based on the above principle, when using Mmap to map a file (or part of a file) to the address space of the process, and does not load the file data, but only in the process of the virtual address space to divide a block, marking the area for mapping to the file data region, The operation of the mmap is complete.
When the process attempts to read or write the file map area, if there is no corresponding physical page, the system occurs with a missing pages exception and enters the fault handler, the page fault handler uses different policies to address page faults based on the type of memory in the region. For the area of virtual memory that uses the mmap mapping file, the handler first finds the relevant file's management data structure, determines the file offset for the desired page, and needs to load the corresponding data from the file into the Page_cache, unlike the read system invocation process. In the process of loading, if the virtual memory area management structure sets the VM_RAND_READ flag, the system simply loads the required page data, and if the Vm_seq_read flag is set, the system will call the same pre-read process as the read system, and the page required by the application In the cache, the system adjusts the page table to match the physical page to the application's address space. Mmap to the processing of the missing pages is not the difference between reading and writing, whether it is read or write caused by the exception of the pages to perform the above process.
The Vm_rand_read flags and Vm_seq_read flags of the virtual memory area management structure can be adjusted using Madvise system calls.
Use mmap read and write files need to pay attention to the problem: when the read-write map of the memory area of the physical page does not exist, the fault occurs when the system can enter the kernel state, if the physical page exists, the application in the user state of the direct operation of memory, will not enter the kernel state, we notice when the call Read/write system tuning , the system calls the Mark_page_accessed function to the page involved, mark_page_accessed can mark the active state of the physical page, the active page is not easy to be recycled, but instead of using mmap to read the file does not produce a fault, it cannot enter the kernel state. The active state of the page cannot be marked so that the page is easily reclaimed by the system (Mark_page_accessed is called only for pages that are missing from the new allocation when it enters the page fault handling). In addition, when writing the memory area, if you do not enter the kernel state and cannot mark the physical page written as dirty (only using the dirty position bit of the page table entry), we will describe this problem in detail in the Msync instructions later.
6 Pread/pwrite,readv/writev
These several system calls in the kernel implementation and read/write difference is not small, only the parameters are different, in the read/write is used in the file default offset, pread/pwrite in the parameters of the specified file operation offset, so in the multi-threaded operation to avoid the read and write offset lock. Readv/writev can write the contents of a file to multiple locations, or write data from multiple locations to a file, thus avoiding the overhead of multiple system calls.
7 Sendfile
Sendfile the contents of a file from a certain point into another file (possibly a socket), which saves the number of copies of the data in memory and, if implemented using Read/write, increases the data copy operation by two times. The kernel implementation method and Read/write are not much different.
8 Fsync/fdatasync/msync
These three system calls involve synchronizing the in-memory dirty page to a file on a block device, and there are some differences between them.
Fsync the file in page cache dirty page to write back to disk, a file in the page cache content including file data also includes inode data, when writing a file, in addition to modifying the file data, Also modifies the data in the inode (such as the file modification time), so there are actually two parts of the data need to be synchronized, Fsync and the specified file related to the two dirty page write back to disk. In addition to using Fsync to forcibly synchronize files, the system automatically synchronizes periodically, which is to write dirty page back to disk.
Fdatasync writes only the file data dirty page to disk, does not write back the file Inode-related dirty page.
Msync and Fsync are different, in the use of mmap mapping file to memory address, write data to the map address if there is no page fault, will not enter the kernel layer, and can not set the state of the write pages to dirty, but the CPU will automatically put the page table dirty location bit, if the page is not set to dirty, Other synchronization programs, such as Fsync and the kernel's synchronization threads, cannot synchronize this part of the data. The main function of Msync is to check the page table of a memory area, set the page state of the page table entry corresponding to the dirty location bit to dirty, and if Msync specify the M_sync parameter, Msync will synchronize the data with Fsync, if specified as M_async, Synchronize the data with a kernel synchronization thread or another call.
When Munmap, the system performs a similar msync operation on the mapped region, so it is not necessarily lost if the Msync data is not called (the process also automatically calls Munmap on the mapped area at exit), but writing large amounts of data does not call Msync at the risk of losing data.
9 Shmget/shmat
In fact, both the POSIX and the System V interface shared memory is implemented using MMAP, and the principle is the same. To map a file (which can be a special file or a normal file) to the address space of a different process, from the principle of the mmap described above, a file in the kernel page cache only one copy, different process operations on the same file area of the mapping, in fact, the implementation of memory sharing.
Based on the above principle, the shared memory of the POSIX interface is easy to understand, and the shared memory of the System V interface looks less intuitive, actually, except that it uses a special filesystem SHMFS to do memory sharing, SHMFS implements a special function, When using normal files for file sharing, when the system needs to reclaim physical pages, the dirty page is written back to disk, and the page is reclaimed, but if Msync is not called, the system cannot know that the page is a dirty page and discards the contents of the page directly when it is recycled (because it considers it to be on disk). This results in inconsistent data, and the implementation of SHMFS is very special, all its pages are always dirty pages, its writeback function is not to write the data back to the normal file, but in the swap partition (if any) to allocate a space, the physical page content to write to the swap partition and mark it. Shmfs avoids the risks that mmap may have during use and is transparent to the user and is designed for memory sharing.
Shmget is the role of the system to request a certain size of the shared memory area, it is only in the operating system is uniquely marked a piece of shared memory area, but at this time did not allocate physical memory, but allocated a management structure, it can be understood to create a file in Shmfs (if already exists, Equivalent to opening a file). Shmat indirectly uses MMAP to map Shmget open (or created) Shmfs files to the address space of the application, and the other processes are the same as mmap ordinary files, except that shared memory cleverly avoids Shmfs's drawbacks through mmap.
malloc
malloc is just a library function that has different implementations for malloc on different platforms, and GLIBC uses the Ptmalloc implementation. malloc allocates memory from the heap, but there is no heap concept in the kernel, and the heap is just the concept of an application that, when created by a process, divides an area in the virtual address space of a process as a heap, and there is no corresponding physical memory for that area. Using malloc to allocate memory is actually just a smaller area from this virtual memory area to the application, and only when the application accesses this small area will a fault of the pages be generated to obtain physical memory. While free does not release physical memory, it returns the small areas allocated on the heap to the heap, which are implemented by GLIBC at the application level.
The use of malloc uses two system calls, BRK and MMAP,BRK, for increasing (or decreasing) the size of the heap, specifying the start address of the heap when the process is created (the heap space is growing upward), and the heap size is 0. When allocating memory using malloc, the size of the BRK growth heap is called when there is insufficient space left in the current heap, in effect BRK the end address of the virtual memory area where the heap resides is increased (or decreased) to a certain location. When malloc allocates more space than a threshold size (for example, 128K), malloc no longer allocates space from the heap, but instead uses mmap to remap a block of virtual addresses and calls Munmap to release the zone when free. Such a strategy is primarily to facilitate heap management, avoiding a large scale management heap, which is based on large memory allocations and does not often use this hypothesis.
It can be noted that if the allocated memory is too large to be allocated and released through the system call, the efficiency will be reduced, so if the application is frequently allocated more than the allocation threshold of memory, the efficiency is very low, this application can adjust the allocation threshold to make the memory allocation and release in the user state (in the heap) completed. You can use mallopt to adjust the parameters of malloc, m_trim_threshold means that if the heap size is larger than this value, the heap size should be shrunk at the appropriate time, m_mmap_threshold indicates that memory allocation requests larger than this value are used MMAP system calls.
Several system calls analyze the malloc calls and shared memory principles in GLIBC