The previous article in this series has analyzed the overall architecture of Lightningmdb and the main data structure. This article describes the mmap principle and how to use it in Lmdb.
1. Memory Map principle
Memory-mapped files are somewhat similar to virtual memory, where a memory-mapped file preserves an area of an address space and submits the physical memory to this zone, except that the physical memory of the memory-file mapping comes from a file that already exists on disk, not the system's page file. And before you can manipulate the file, you must first map the file as if you were loading the entire file from disk into memory. As you can see, using a memory-mapped file to process a file stored on disk does not require the application to perform I/O to the file, which means that when the file is processed, it will no longer be requested and allocated for caching, and all file cache operations are managed directly by the system, due to the cancellation of loading the file data into memory, The steps of data from memory to file writeback and freeing memory blocks make the memory-mapped file play a significant role in processing large data volumes. In addition, the actual project in the system often need to share data between multiple processes, if the amount of data is small, the processing method is flexible, if the shared data capacity is huge, then need to use the memory map file. In fact, a memory-mapped file is the most efficient way to resolve data sharing across multiple processes on-premises.
According to the survey, the operating efficiency of mmap is 2-4 times of normal file IO operation. The main reason is to avoid the process of IO operation, memory application, replication and trans-kernel space conversion.
2. How Windows and Linux are implemented
Windows is done through the memory-mapped file (createfilemapping) series function, which exposes the API schema as shown in
650) this.width=650; "Src=" http://b.hiphotos.baidu.com/baike/c0%3Dbaike80%2C5%2C5%2C80%2C26/sign= 793aca0a533d26973ade000f3492d99e/bd315c6034a85edf546709e34b540923dc547503.jpg "/>
It is a way of memory management and is the basic way to make big data sharing between processes.
The basic way to use this is:
The first thing to do is to create or open a file kernel object through the CreateFile () function, which identifies the file on disk that will be used as a memory-mapped file. After the file image is advertised to the operating system in the location of the physical memory with CreateFile (), only the path to the image file is specified, and the length of the image is not specified. To specify how much physical storage is required for a file-mapped object, you also need to create a file-mapped kernel object through the createfilemapping () function to tell the system file size and how to access the file. After you create a file-mapping object, you must also reserve an address space area for the file data and submit the file data as a physical memory that is mapped to the zone. The MapViewOfFile () function is responsible for mapping all or part of a file-mapped object to the process address space through the management of the system. At this point, the use and processing of the memory-mapped file is basically the same as the processing of the file data that is normally loaded into memory, and when the use of the memory-mapped file is completed, a series of operations are done to complete the cleanup and release of the resources. This section is relatively straightforward and can be done by UnmapViewOfFile () to undo the image of the file data from the process's address space, and close the previously created file-mapping object and file object by CloseHandle ().
Linux is implemented through the MMAP series functions. Basic process:
650) this.width=650; "Src=" http://images.cnitblog.com/blog/552564/201401/02145318- A28b8755b7e447c599a1a1895858a9c6.gif "/>
The general file IO operation is as follows:
650) this.width=650; "Src=" http://images.cnitblog.com/blog/552564/201401/02145346- F97b72a1aee84cb59075fed5da0bae62.gif "/>
From the above two graphs, the direct file IO will inevitably make multiple memory copies.
Based on the above-mentioned system memory mapping principle, memory mapping is the kernel level memory management mode, which does not cause swap (because of insufficient physical memory), such as additional disk IO,
The efficiency is very high, therefore it also has the certain adaptability in the database domain. A memory-mapped database system that, when the actual data file is less than the size of the process's available physical memory,
Efficiency is much higher than the general database system, when the data file is relatively large, if the application access to the page is very scattered and large number, such as full table scan, then memory mapping will be frequent
Departure page faults, and then frequent swap, so that one IO into 2 times io, but the efficiency is reduced. If the app access is basically index-scanned, the above can be avoided, even if the data
The file is much larger than the actual available physical memory, then the efficiency is still good. At the same time the system memory mapping method realizes the database system will greatly simplifies the memory management, the cache management, the external memory management
And so on, so it is the preferred implementation of a certain scale and specific application, Lmdb is mainly based on the above considerations using memory mapping.
3. How to use Lmdb
Lmdb when creating an environment (Env object), first check the file header information, and get the file size, in the process of opening through the system function to map the file.
Other moments use memory pointers directly to get the corresponding data through system-level fault pages. The acquisition and use of data within the page is mdb_cursor_get.
Page acquisition and key queries are completed via Mdb_page_get/mdb_page_search.
To understand why the mmap mapped address space and pointers are available for the Lmdb code, first understand how the Lmdb page data is organized, the following example is explained by a leaf page,
The branch page is similar.
The data for the leaf page is organized as follows:
Pgno |
Pad |
Flags |
Overflows① |
Nd_index1 |
Nd_index2 |
Nd_index3 |
Nd_index4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
node4[ |
Lo |
Hi② |
Flags |
KeySize |
Data (Key) |
data* (value) |
] |
node3[ |
Lo |
Hi |
Flags |
KeySize |
Data (Key) |
data* (value) |
] |
node2[ |
Lo |
Hi |
Flags |
KeySize |
Data (Key) |
data* (value) |
] |
node1[ |
Lo |
Hi |
Flags |
KeySize |
Data (Key) |
data* (value) |
] |
①overflows is a Union object that represents the number of pages with low free space, high address, or overflow.
Overflow page is a continuous page, data just point to the first page, subsequent pages do not need Pgno also
does not cause other Pgno errors.
The ② node size is determined by the LO and the low 16 bits of hi.
The key of a node is a variable size, determined by keysize, and is contained in data
The value of the node occupies a larger amount of memory, with the environment specifying the maximum node size. Its data will point to the overflow page.
Page header size and content is fixed, the meaning of the specific means, according to flags, immediately after the head is node, the real key-value value of the index of the location, so access to these node
The corresponding position can be obtained by calculating the pointer. The search for the page is determined by binary lookup.
The index portion of the node, Nd_index sorted by key size, i.e. KEY[INDEX2] must be greater than or equal to KEY[INDEX1]. Insert Sort algorithm for node insertion, and from page header to
Page in the middle of the closer.
Node Content section, in the order of insertion, from the page address at the top of the page to consider. Node content part remains unordered, that is, add key to 1,2,3,4, insert order is 1,4,3,2, index part
Is 1,2,3,4, while the data part is 1,4,3,2.
Data parts and indexes are stored directly (via memcpy) rather than stored pointers, so the data is available when the mapping is done through mmap after serialization. memcpy on this inevitable
Another reason is that data is passed from the application, and the non-replication of direct storage will result in a memory unavailable exception (segment fault error) when accessed again.
Therefore, in Lmdb, the most important thing is how to map the page into the process address space. Lmdb Gets the page and returns the page pointer by using the Mdb_page_get function with PageNo as the main parameter. If it is only
Read-only transactions and the Environment object is opened read-only, page fetching is simple, based on page=mapadress + Pagno * pagesize. The reason why this works is possible is that the previous article mentioned
In the Lmdb B+tree is based on append-only b+tree transformation. PageNo also increases when the data is added, modified, deleted, and when the old page (old version of the data) is reused,
The PageNo remains the same, so PageNo maintains the order of the data files, so that when the page is fetched, only a simple calculation is required. While creating an Env object, the database has been
Mapped into the entire process space, so when the system is mapped, the database file will be kept full address space, in accordance with the above algorithm to obtain a real database, the system triggers a fault, and then from the data file
Gets the entire page content. This is the simplest and most efficient way to not map all the data into the address space, and for unmapped portions it is necessary to determine if the page is mapped when it is not mapped.
Another note: Lmdb for dirty page refresh, take an optional way, support through memory mapping write, also support through file writing. The default support is to write through files. The application is read-only when the memory is mapped
is opened and written in file mode when needed. Lmdb guarantees that only one write operation is in progress at any given time, thus avoiding data corruption during concurrency.
This article refers to some of the other friends of the blog, here thank them for their hard work.
"1" http://blog.csdn.net/hongchangfirst/article/details/11599369
"2" http://blog.csdn.net/hustfoxy/article/details/8710307
"3" http://blog.csdn.net/joejames/article/details/37958017
"4" Http://baike.baidu.com/link?url=8sD5zxtuTO2_wUwr5N4B6F-ZjnaedfnMjv3BOMQPatVfkO8E60Enq4_VayEwvdDuQOlLbyktGBe7S3Z9Zd5fjK
"5" Http://baike.baidu.com/link?url=8sD5zxtuTO2_wUwr5N4B6F-ZjnaedfnMjv3BOMQPatVfkO8E60Enq4_VayEwvdDuQOlLbyktGBe7S3Z9Zd5fjK
Lightning MDB source Code Analysis (2)