Several system calls analyze the malloc call and shared memory principles in glibc

Source: Internet
Author: User
Tags sendfile
This article mainly analyzes the implementation principles of memory and I/O-related system calls and library functions, and provides the problems that need to be paid attention to during use and the focus of optimization, system calls involved in this article include readahead, pread/pwrite, read/write, MMAP, readv/writev, sendfile, fsync/fdatasync/msync, shmget, and malloc.

This article first briefly introduces the memory usage of the application and the basic principle of memory usage of the I/O system, which is helpful for understanding the above system calls and implementation of library functions.

 

1. memory management basics

Linux manages physical memory in pages. Generally, the page size is 4 kb. During Linux Initialization, all physical memory is allocated with a management data structure to manage all physical pages.

Each application has an independent address space. Of course, this address is virtual. You can use the page table of the application to convert the virtual address to the actual physical address for operations, although the system can convert from a virtual address to a physical address, not every virtual memory of an application corresponds to a physical memory. In Linux, a pay-as-you-go policy is used to allocate physical memory to applications, which is implemented by page missing exceptions. For example, if an application dynamically allocates 10 MB of memory, the allocated memory only indicates that the address in this interval has been occupied in the virtual memory area management structure of the application, at this time, the kernel does not allocate physical memory for it, but finds that the physical memory corresponding to the memory address does not exist when the application uses (reads and writes) the memory area. In this case, a page exception occurs, the kernel is prompted to allocate a physical memory page to the virtual memory page currently accessed.

A physical page that has been allocated to an application may be recycled by the system for other purposes in some cases. For example, the content of the dynamically allocated memory may be changed to swap partitions, the system temporarily recycles the physical page. When the application uses this memory page again, the system allocates the physical page and switches the content of this page from the swap partition back to this physical page, create a page table ing. The physical memory allocation and recovery processes for different types of virtual memory pages are different. when analyzing the details of specific system calls, let's make a detailed description.

 

2 file system I/O Principle

The I/O part of the operating system involves not only operations on common Block devices, but also operations on character devices and network devices. This article only describes general Block devices.

Applications can perform operations on files in two ways: Common read/write and MMAP methods, however, these two methods do not allow applications to directly operate on Block devices when reading and writing file content (with some special exceptions), but go through the page cache at the operating system layer, that is, no matter which method the application reads the file data, the operating system loads the file data to the kernel layer, and the application operates the file data in the memory in different ways, the write process is the same. The application only writes data to the page corresponding to the file in the memory, and then forcibly writes the data back to the block device at a certain time.

We need to describe the page cache, which can be understood as a buffer for all file data. In general, for file operations, you must use the page cache middle layer (which will be described below in special cases). Page cache is not just a data transfer layer. In the page cache layer, the kernel effectively manages the data. Unused data is still stored in the page cache when the memory permits. All applications share a page cache, different applications do not need to access Block devices multiple times to access the same file or data, this accelerates the performance of I/O operations.

Different system calls have different access methods to file data. Data operations between page cache and Block devices are similar. This difference is mainly reflected in the read/write and MMAP modes. Their respective details are described below.

3 readahead

After describing the principles and functions of page cache, readahead is easy to understand. When the system calls read to read part of the file data, if the data is not in page cache, you need to read the corresponding data from the block device. For Block devices like disks, searching is the most time-consuming operation, reading a small piece of data is not much different from reading a large piece of continuous data. However, if this large piece of data is read multiple times, you need to find the data multiple times, this takes a long time.

Readahead is based on this policy: When you need to read a piece of data, if the subsequent operation is continuous read, you can read more data into the page cache, in this way, the next time you access the continuous data, the data is already in the page cache, and no I/O operations are required. This will greatly improve the efficiency of data access.

In Linux, readahead is divided into automatic mode and user forced mode. Automatic pre-Read means that when the read system calls, if you need to transmit data from the block device, the system automatically sets the size of the pre-read data based on the current status to enable the pre-read process. The size of each pre-read data is dynamically adjusted. The adjustment principle is to appropriately expand or reduce the pre-read size based on the pre-read hit conditions. The default size of each pre-read can be set, and different Block devices can have different default pre-read sizes. You can use the blockdev command to view and set the default pre-read size of Block devices.

This automatic mode pre-read mechanism is enabled before each I/O operation, so the default pre-read size settings have some impact on the performance. For a large number of random read operations, in this case, you need to reduce the pre-read value, but it is not the smaller the better. Generally, You need to estimate the average size of the data volume read by each Read Request by the application, it is more appropriate to set the pre-read value to a slightly larger value than the average size. for a large number of sequential read operations, the pre-read value can be larger (for raid, for more information about how to set the pre-read value, see the Strip size and number of strip ).

Note that in the automatic pre-Read mode, if the file itself has many small fragments, even continuous reading, and a large pre-read value is set, the efficiency is not very high, because if the data read at a time is not continuous in the disk, it is still inevitable to find the disk, so the pre-read function is not very effective.

Linux provides a readahead System Call setting to forcibly pre-read files. This operation loads data from the block device to the page cache, which can speed up file data access, you can decide whether to use forced pre-read Based on your needs.

 

4 read/write

Read/write is the basic process of reading and writing I/O. Apart from MMAP, the basic principle and call process of other I/O read/write system calls are the same as that of read/write.

Read process: Convert the data to be read to the corresponding page, and execute the following process for each page to be read: Call page_cache_readahead (if pre-read is enabled ), based on the current pre-read status and the execution of the pre-read policy (the pre-read status structure is dynamically adjusted based on hit conditions and read modes, and the pre-read policy is also dynamically adjusted ), i/O operations may not be performed during the pre-read process. After the pre-read process is completed, first check whether the required data already exists in the page cache. If not, the pre-read is not hit, call handle_ra_miss to adjust the pre-read policy and perform I/O operations to read the page data into the memory and add it to the page cache. After the page data is read into the page cache (or previously in the page cache ), mark the page mark_page_accessed and copy the page data to the application address space.

Write process: like the read process, you need to convert the data to be written to the corresponding page, copy the data from the application address space to the corresponding page, and mark the page status as dirty, callMark_page_accessed,If synchronous write is not specified, the write operation returns now. If o_sync is specified when the file is opened, the system will write all the dirty pages involved in this write process back to the block device. This process is blocked. The dirty page synchronization is described in detail when analyzing fsync/fdatasync/msync.

Special case: If o_direct is specified when an application opens a file, the operating system will completely bypass page cache when reading and writing a file, during reading, data is directly transmitted from the block device to the specified cache of the application. During writing, data is directly written to the block device from the cache specified by the application, because it does not pass through the page cache layer, writing in this mode is always synchronous.

 

5 MMAP

MMAP is widely used. It not only maps files to the memory address space to read and write files, but also implements shared memory using MMAP. MMAP is also used for memory allocation by malloc, this section describes how to use MMAP to read and write files.

Each process manages virtual memory by region. When virtual memory is allocated, different virtual memory regions are divided for different purposes, these virtual memory areas do not allocate the corresponding physical memory until the initial allocation, but only allocate and set the management structure. When the process uses the memory of a certain area, when there is no corresponding physical memory, the system generates a page missing exception. In a page missing exception, the system allocates physical memory based on the virtual memory management structure corresponding to the memory, when necessary (such as MMAP), load data to this physical memory, establish a ing between the virtual memory and the physical memory, and then the process can continue to access the virtual memory.

The implementation of MMAP is also based on the above principle. When MMAP is used to map a file (or a part of the file) to the address space of the process, the file data is not loaded, in addition, the virtual address space of a process is used to divide an area and mark the area for ing to the data area of the file. The MMAP operation is complete.

When a process tries to read or write a file to map a region, if there is no corresponding physical page, the system encounters a page missing exception and enters the page missing Exception Processing Program, the page missing exception handler uses different policies to solve the page missing according to the memory type in the region. For the virtual memory area mapped to a file using MMAP, the handler first finds the management data structure of the relevant file and determines the file offset corresponding to the required page, in this case, you need to load the corresponding data from the file to page_cache. Unlike the read system calling process, if the vm_rand_read flag is set in the virtual memory region management structure during the loading process, the system only loads the required page data. If the vm_seq_read flag is set, the system will call the same pre-read process as the read system. The page required by the application is already in page cache, the system adjusts the page table to map the physical page to the address space of the application. MMAP does not have the difference between read and write operations on missing pages. The above process must be performed for both read and write errors.

The vm_rand_read flag and vm_seq_read flag of the virtual memory region management structure can be adjusted using the madvise system call.

Note when using MMAP to read and write files: when the physical page in the memory area of the read/write ing does not exist, the system enters the kernel state in case of a page missing exception. If the physical page exists, the application operates the memory directly in the user mode and does not enter the kernel state. When you call the read/write system call, the system calls the mark_page_accessed function for all pages involved, mark_page_accessed can mark the activity status of the physical page. The active page is not easy to recycle. Instead, it cannot mark the activity status of the page if the MMAP reading file does not generate a page missing exception and cannot enter the kernel state, in this way, the page will be easily reclaimed by the system (mark_page_accessed is called only for the newly assigned missing page when the page is being thrown for exception handling ). In addition, if you do not enter the kernel state, you cannot mark the written physical page as dirty (only the dirty position of the page table item is used) When writing the memory area ), this issue will be detailed in the msync description below.

 

6 pread/pwrite, readv/writev

The implementation of these system calls in the kernel is not much different from read/write, but the parameters are different. In read/write, the default file offset is used, pread/pwrite specifies the file operation offset in the parameter type, which avoids locking the read/write offset in the multi-threaded operation. Readv/writev can write the file content to multiple locations, or write data to the file from multiple locations, so that the overhead of multiple system calls can be avoided.

 

7. sendfile

Sendfile transfers the content of a file starting from a certain position to another file (may be a socket). This operation reduces the number of data copies in the memory. If read/write is used, two data copy operations are added. The kernel implementation method is similar to read/write.

 

8. fsync/fdatasync/msync

These three system calls involve synchronizing the dirty page in the memory to the files on the block device. There are some differences between them.

Fsync writes the dirty page in the page cache back to the disk. The content of a file in the page cache includes the file data and inode data. When writing a file, in addition to modifying the file data, data in inode is also modified (such as the file modification time). Therefore, data in these two parts must be synchronized, fsync writes the two dirty pages related to the specified file back to the disk. In addition to using fsync to forcibly Synchronize files, the system also regularly synchronizes files, that is, the dirty page is written back to the disk.

Fdatasync only writes the dirty page of the file data back to the disk, and does not write back the inode-related dirty page of the file.

Msync is different from fsync. When you use MMAP to map files to memory addresses and write data to ing addresses without missing pages, msync will not enter the kernel layer, the page writing status cannot be set to dirty, but the CPU will automatically set the dirty position of the page table. If the page is not set to dirty, other synchronization programs, for example, neither fsync nor the kernel synchronization thread can synchronize this part of data. Msync is mainly used to check the page table in the memory area and set the status of the page corresponding to the dirty position table item to dirty. If msync specifies the m_sync parameter, msync synchronizes data like fsync. If m_async is specified, data is synchronized using the kernel synchronization thread or other calls.

In munmap, the system performs msync-like operations on the mapped area, therefore, if you do not call msync, data will not be lost (munmap is automatically called when the process exits). However, if you write a large amount of data and do not call msync, data loss may occur.

 

9. shmget/shmat

In fact, both POSIX and System V interface shared memory are implemented using MMAP, And the principle is the same. Map a file (which can be a special file or a common file) to the address space of different processes. We can learn from the MMAP principle described above, A file has only one copy in the kernel page cache. Different process operations map to the same file region, which actually achieves Memory Sharing.

Based on the above principle, the shared memory of the POSIX interface is easy to understand. The shared memory of the System V interface does not look so intuitive, but it is actually the same, however, it uses the shmfs of a special file system for memory ing to achieve Memory sharing. shmfs implements a special function. When a common file is used for file sharing, when the system needs to recycle the physical page, the dirty page is written back to the disk and then recycled. However, if msync is not called, the system cannot know that the page is a dirty page, when the page is recycled, the content of the page will be discarded directly (because it thinks there is still on the disk ). This leads to data inconsistency, and the implementation of shmfs is very special. All its pages are always dirty pages, and its write-back function does not write data back to common files, instead, allocate a space in the SWAp partition (if any), write the physical page content to the swap partition, and mark it. Shmfs avoids the risks that may occur during MMAP usage and is transparent to users. It is designed for Memory Sharing.

The role of shmget is to apply for a shared memory area of a certain size. It only indicates a shared memory area in the operating system, but physical memory is not allocated at this time, only the management structure is allocated. It can be understood that a file is created in shmfs (if it already exists, it is equivalent to opening a file ). Shmat indirectly uses MMAP to map the shmfs file opened (or created) by shmget to the address space of the application. Other processes are the same as the processing of common MMAP files, however, the shared memory cleverly avoids the disadvantages of MMAP through shmfs.

 

10 malloc

Malloc is just a library function, which has different implementations of malloc on different platforms. glibc uses the implementation of ptmalloc. Malloc allocates memory from the heap, but there is no heap concept in the kernel. heap is just an application concept. When a process is created, in the virtual address space of a process, an area is divided into two parts as a heap, and this area does not have the corresponding physical memory, using malloc to allocate memory is actually only a smaller area from this virtual memory area to the application. Page breaks are generated only when the application accesses this small area to obtain the physical memory. Free does not release physical memory, but returns the small area allocated on the stack to the stack. These operations are implemented by glibc at the application layer.

When using malloc, two systems are used to call BRK and MMAP. BRK is used to increase or decrease the size of the heap, when a process is created, the starting address of the heap is specified (the heap space increases upwards), and the heap size is 0, when you use malloc to allocate memory and find that the remaining space of the current heap is insufficient, the BRK will be called to increase the heap size, in fact, BRK is used to increase (or decrease) the ending address of the virtual memory area where the heap is located to a certain position. When the space allocated by malloc is greater than an hour (for example, 128 Kb), malloc will not allocate space from the stack, but will use MMAP to remap a virtual address area, call munmap to release this region when you are free. This policy is mainly to facilitate heap management and avoid managing heap at a large scale. This is based on the assumption that large memory allocation is not often used.

It can be noted that if the allocated memory is too large, it will be called by the system during the allocation and release, and the efficiency will be reduced. Therefore, if the application frequently allocates memory larger than the allocation threshold value, the efficiency will be very low, this application can adjust the allocation threshold so that the memory allocation and release are completed in the user State (in the heap. You can use mallopt to adjust the malloc parameter. m_trim_threshold indicates that if the heap size is greater than this value, the heap size should be reduced as appropriate, m_mmap_threshold indicates that memory allocation requests greater than this value must be called by the MMAP system.

Several system calls analyze the malloc call and shared memory principles in glibc

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.