The story behind malloc

Last Update:2018-07-26 Source: Internet

Author: User

Tags posix prefetch sendfile

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper mainly analyzes the memory and I/O-related system calls and library functions of the implementation of the principle, according to the principle of the use of the need to pay attention to the problem and the focus of optimization, this article refers to the system calls include Readahead,pread/pwrite,read/write,mmap, Readv/writev,sendfile,fsync/fdatasync/msync,shmget,malloc.

This article is a brief introduction to the application's use of memory and the basic principles of the I/O system's use of memory, which is helpful to understand the implementation of these system calls and library functions.

1 Memory Management Basics

The Linux management of physical memory is in pages, usually the page size is 4kb,linux the management data structure is assigned to all physical memory during initialization, and all physical pages are managed.

Each application has a separate address space, of course, this address is virtual, through the Application page table can be translated into the actual physical address of the virtual address to operate, although the system can be implemented from the virtual address to the physical address of the conversion, but not the application of each virtual memory corresponding to a piece of physical memory. Linux uses an on-demand strategy to allocate physical memory to the application, which is implemented using a page fault exception. For example, an application dynamically allocates 10MB of memory that is allocated only in the application's virtual memory zone management structure to indicate that the address of the interval is already occupied, the kernel does not allocate physical memory for it at this time, but when the application uses (reads and writes) the memory area, found that the memory address does not exist for physical memory, resulting in a page fault exception, prompting the kernel for the current access to the virtual memory page allocation of a physical memory pages.

A physical page that has been assigned to the application, in some cases it can also be reclaimed by the system for other purposes, such as for the dynamically allocated memory above, the contents of which may be swapped to the swap partition, the system temporarily reclaims the physical page, and when the application uses the memory page again, the system assigns the physical page, Change the contents of the page from the swap partition back to the physical page, and then re-establish the page table mapping relationship. Different types of virtual memory pages corresponding to the allocation of physical memory processing process is different, in the analysis of specific system call details, we do a detailed explanation.

2 File system I/O principle

The operating system I/O part not only involves the operation of the common block device, but also involves the operation of the character device and the network device, this article only involves the description of the common block device.

There are two basic ways in which an application can operate on a file: ordinary Read/write and mmap, but either way, the application does not directly manipulate block devices (with some special exceptions) when reading and writing the contents of the file, but through the page cache of the operating system layer, that is, Regardless of how the application reads the file data, is the operating system to load this part of the file data into the kernel layer, the application in different ways to manipulate the file data in memory, the same process of writing, the application is actually just write the data to the file in the memory of the corresponding page, Then at a certain time or force back to write to the block device.

We need to explain page cache, page cache can be understood as a buffer of all file data, in general, file operations need to be through the page cache this middle layer (special case we will describe below), page Cache is not just a data transfer layer, the page cache layer, the kernel of the data to do effective management, temporarily unused data in memory allows the case is still placed in the page cache, all the applications share a page cache, Access to the same file data by different applications, or multiple accesses to a piece of data by an application, does not require multiple access to the block device, which speeds up the performance of I/O operations.

Different system calls to the file data access is different from the page cache on the way the data access, the data between the page cache and block the operation of the process is basically similar. This kind of difference mainly manifests in the Read/write way and the mmap way. Their respective details are described separately below.

3 ReadAhead

After describing the principle and function of page cache, it is easier for ReadAhead to understand that when using system call read for reading file part data, if the data is not in page cache, you need to read the corresponding data from the block device, for a block device like disk, Seek is the most time-consuming operation, read a small piece of data and read a large number of consecutive data is not a big difference, but if this chunk of data read multiple times, it will take many times to seek, so that the time spent relatively long.

ReadAhead is based on the strategy: when you need to read a piece of data, if the subsequent operation is continuous read, you can read some data to the page cache, so the next visit to the continuous data, the data is already in the page cache, there is no I/O operation, This can greatly improve the efficiency of data access.

The ReadAhead of Linux is divided into automatic mode and user coercion mode, and automatic prefetching means that when the read system is called, if you need to transfer data from a block device, the system automatically sets the size of the prefetch data according to the current state and enables the read-through process. The size of each prefetch data is dynamically adjusted, and the adjustment principle is to enlarge or reduce the prefetch size appropriately based on the pre read hit. The default size for each prefetch can be set, and different block devices can have different default prefetch sizes, and viewing and setting the block device default prefetch size can be done through the Blockdev command.

This automatic mode read-ahead mechanism is enabled before each I/O operation, so the pre read default size setting has some effect on performance, and if it is a large number of random read operations, in which case the prefetch value needs to be lowered, but not as small as possible. It is generally necessary to estimate the average size of the amount of data that the application reads per read request, and it is appropriate to set the prefetch value slightly larger than the average size; if it is a large number of sequential read operations, the prefetch value can be larger (for use with raid Pre-read value settings also refer to stripe size and stripe count.

The problem that needs to be noticed in the auto read mode is that if the file itself has a lot of small fragments, even continuous reading, but also set a large read value, its efficiency is not too high, because if the data read on the disk is not continuous, it is still inevitable disk seek, so the role of the pre-read is not much.

Linux provides a readahead system call setting that forces the file to be read, which is loading data from a block device into the page cache, increasing the speed at which access to file data can be accessed, and the user can decide whether to use forced prefetching according to their own needs.

4 Read/write

Read/write is the basic process of reading and writing I/O, except mmap, the basic principles and calling procedures for other I/O read and write system calls are the same as read/write.

Read process: Convert the data that needs to be read to the corresponding page, and perform the following procedure for each page that needs to be read: First call Page_cache_readahead (if read on), dynamically adjust according to the state of the current prefetch and the read-by-read policy (read-only state structure according to the hit and reading mode) , read-back policy is also dynamically adjusted), the read-through process will be I/O operations may not, after the completion of the read process, first check whether the page cache has required data, if not, that the read did not hit, call Handle_ra_miss adjust the read strategy, to i/ O operation reads the page data into memory and joins page cache, when the page data is read into page cache (or previously in page cache), marks the page mark_page_accessed, and then copies the page data to the application address space.

Write process: As in the read process, you need to convert the data you need to write to the corresponding page, copy the data from the application address space to the corresponding page, and mark the page state as dirty, call mark_page_accessed, and if you do not specify synchronous write, the write operation is returned. If the file specifies O_sync when it is opened, the system will write back all the dirty pages involved in this write process to the block device, which is blocked. About the synchronization of dirty pages in the analysis of Fsync/fdatasync/msync we specify.

Special case: If the application specifies o_direct when the file is opened, the operating system bypasses page cache when reading and writing files, the data is transferred directly from the block device to the application-specified cache, and the data is written directly from the application-specified cache to the block device. Because there is no page cache layer, writing in this way is always written synchronously.

5 mmap

Mmap is widely used, not only can map files to memory address space to read and write files, can also use mmap to achieve shared memory, malloc allocated memory is also used mmap, this section we first discuss the use of Mmap read and write file implementation.

Each process is managed through the subregion for virtual memory, when virtual memory is allocated, different virtual memory regions are divided for different purposes, and these virtual memory regions do not allocate the corresponding physical memory at the beginning of the allocation, but only allocate and set up the management structure, when the process uses the memory of a region, And it has no corresponding physical memory, the system produces the page fault anomaly, in the page fault anomaly, the system according to this memory corresponding virtual memory management structure allocates the physical memory for it, if necessary (such as mmap) loads the data to this piece of physical memory, establishes the virtual memory to the physical memory correspondence relation, The process can then continue to access just the virtual memory.

The implementation of Mmap is also based on the principle that when mapping a file (or part of a file) to the address space of a process using mmap, it does not load the data of the file, but simply divides the area of the virtual address space of the process, marking the area used to map the data area of the file, The mmap operation is complete.

When the process is trying to read or write the file mapping area, if there is no corresponding physical page, the system occurs in the fault and the missing exception handler, the page fault exception handler according to the type of memory of the region using different strategies to solve the fault pages. For the virtual memory region using the mmap mapping file, the handler first finds the management data structure of the relevant file, determines the file offset of the desired page, and then loads the corresponding data from the file into the Page_cache, which is different from the read system call process. In the process of loading, if the virtual memory zone management structure set the VM_RAND_READ flag, the system simply loads the required page data, and if the Vm_seq_read flag is set, the system will make the same read system call to the same prefetch process, so the page required by the application is already on page Cache, the system adjusts the page table to correspond the physical pages to the application's address space. Mmap on the processing of page faults is not read and write the difference, whether it is read or write caused by the exception of the page faults to carry out the above process.

The Vm_rand_read flags and Vm_seq_read flags for the virtual memory zone management structure can be adjusted using Madvise system calls.

Use mmap read and write files need attention to the problem: when reading and writing mapped memory area of the physical page does not exist, the system can enter the kernel when the page fault is abnormal, if the physical pages exist, the application in the user state direct operation of memory, will not enter the kernel state, we note that in the call , the system refers to the pages are called the mark_page_accessed function, mark_page_accessed can mark the activity of the physical page, the active page is not easy to be recycled, but with mmap read files do not produce page faults can not enter the kernel state, Can not mark the active state of the page, so that the page can easily be recycled (into the fault of the page when the exception is only the new allocation of pages called mark_page_accessed). In addition, when writing this memory area, if you do not enter the kernel state can not mark the written physical page as dirty (only use the dirty position of the page table entry), this problem we will be described in detail in the later Msync description.

6 Pread/pwrite,readv/writev

These system calls in the kernel and the implementation of the read/write is not very different, but the parameters are not the same, in the read/write used is the file default offset, pread/pwrite in the parameters of the specified file operation offset, so in multithreading operation to avoid the lock for read-write offset. Readv/writev can write the contents of a file to multiple locations, and can write data from multiple locations to a file, which avoids the overhead of multiple system calls.

7 Sendfile

Sendfile the contents of a file from a location into another file (possibly a socket), which saves the number of copies of the data in memory, and, if implemented with Read/write, adds two copies of the data. Its kernel implementation method and Read/write are not much different.

8 Fsync/fdatasync/msync

These three system calls involve the dirty page in memory synchronized to the files on the block device, there are some differences between them.

Fsync the file in page cache dirty page write back to disk, a file in page cache content includes file data also includes inode data, when writing a file, in addition to modifying file data, The data in the Inode is also modified (such as the file modification time), so there are actually two parts of the data that need to be synchronized, Fsync to write the two dirty page files related to the specified file to disk. In addition to using Fsync to force synchronization of files, the system automatically synchronizes periodically, that is, the dirty page is written back to disk.

Fdatasync only writes back the file data dirty page to the disk, does not write back the file Inode related dirty page.

Msync and Fsync are different, in the use of mmap mapping file to memory address, write data to the mapping address if there is no page fault, you will not enter the kernel layer, you can not set the state to write to dirty, but the CPU will automatically place the page table dirty position, if not set page for dirty, Other synchronizations, such as Fsync and the kernel's sync thread, cannot synchronize this part of the data. The main function of Msync is to check the page table of an area of memory, set the status of the page corresponding to the page table entry in the dirty position to dirty, and if Msync specifies the M_sync parameter, Msync also synchronizes the data as Fsync, if specified as M_async, Synchronizes the data with a kernel synchronization thread or other invocation.

When Munmap, the system performs a similar msync operation on the mapped zone, so it is not necessarily lost if the Msync data is not invoked (the process automatically invokes Munmap on the mapped area when the procedure exits), but writing large amounts of data without invoking Msync can have the risk of losing data.

9 Shmget/shmat

In fact, both POSIX and System V interface shared memory, are implemented using MMAP, the principle is the same. Maps a file (which can be a special file or a normal file) to the address space of a different process, from the principle of the mmap described above, a file in the kernel page cache only one copy, different process operations on the same file area mapping, in fact, to achieve the sharing of memory.

Based on the above principle, the POSIX interface shared memory is easy to understand, the System V interface shared memory does not look so intuitive, in fact, but it uses a special file system Shmfs to do memory mapping to achieve memory sharing, SHMFS implementation of a special function, When you use normal files for file sharing, when the system needs to recycle the physical pages, is to write the dirty page back to disk and then recycle the page, but if Msync is not invoked, the system cannot know that the page is a dirty page and discards the content of the page when the page is recycled (because it thinks there is another on the disk). This results in inconsistent data, and Shmfs's implementation is very special, all of its pages are always dirty pages, its writeback function is not to write data back to the normal file, but in the swap partition (if any) allocate a piece of space, the physical page content to the swap partition and mark it. Shmfs avoids the risk that mmap may appear in use and is transparent to the user, designed for memory sharing.

Shmget's role is to ask the system to apply for a certain size of shared memory area, it is only in the operating system marked a shared memory area, but at this time did not allocate physical memory for it, but the allocation of the management structure, can be understood as a SHMFS in the creation of a file (if already exist, Equivalent to opening a file). Shmat indirectly uses MMAP to map Shmget open (or create) Shmfs files to the application's address space, and other processes are the same as mmap normal files, except that shared memory passes SHMFS subtly avoids mmap's shortcomings.

Ten malloc

malloc is just a library function, in different platform to malloc have different implementations, glibc use is the realization of ptmalloc. malloc is the memory allocated from the heap, but there is no heap concept in the kernel, the heap is just an application concept, when the process is created, in the process of virtual address space divided into a heap, the region does not have the corresponding physical memory, Allocating memory using malloc is actually just a smaller area of the virtual memory area for the application, and only when the application accesses the small area does the page break and the physical memory is obtained. Instead of releasing the physical memory, free returns the small area allocated on the heap to the heap, which is implemented by GLIBC at the application level.

The use of malloc uses two system calls BRK and mmap,brk to grow (or decrease) the size of the heap, specifying the starting address of the heap when the process is created (heap space is growing upwards), and the heap size is 0, When allocating memory using malloc, the size of the BRK growth heap is invoked when the current heap does not have enough free space, and BRK is actually used to grow (or decrease) the end address of the virtual memory area in which the heap resides. When the malloc space is larger than a threshold size (for example, 128K), malloc no longer allocates space from the heap, but instead uses mmap to remap a virtual address area, which is called at free time Munmap to release the area. Such a strategy is primarily convenient for heap management, avoiding a large scale management heap, which is based on large memory allocations that do not often use this assumption.

You can note that if the allocated memory is too large, efficiency can be reduced when allocating and releasing with system calls, so if the application frequently allocates memory larger than the allocation threshold, the efficiency will be very low, and this application can be done by adjusting the allocation thresholds so that memory allocations and releases are performed in the user state (in the heap). malloc parameters can be adjusted by using mallopt, M_trim_threshold indicates that if the heap size is greater than this value, the heap size should be shrunk at the appropriate time, m_mmap_threshold that a memory allocation request greater than this value is to be invoked using the MMAP system.

Go from: http://hemming.blog.hexun.com/15094457_d.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More