Lightning MDB Source Code analysis (1)

Source: Internet
Author: User
Tags mutex

Lighting mdb (lmdb) is a high performance mmap KV database, basic introduction and documentation see Symas official website, this article will try to analyze its source code structure to understand the key technology of database design.

This series of articles will attempt to analyze from the following aspects.

    1. System Architecture (This article)
    2. Mmap Mapping (series 2)
    3. B+tree operation (Series 3)
    4. Transaction Management (series 4)
    5. MVCC Control (Series 5)

And so on several aspects to carry on the analysis.

Lmdb is a database designed to improve the design of a data cache backend database (BDB) for OPENLADP Engineering, such as multi-cache design, lock control, and space expansion issues.

It has the characteristics of simple management, simple development and so on. Management is simple because the database-level cache is removed from the design, and the development is simple and compatible with the BDB development interface, without worrying about data corruption, space expansion, deadlock, etc.

Problem.

The basic architecture of the system is:

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M00/73/03/ Wkiom1xyjxxamjvhaabwdl0nrx4326.jpg "width=" 365 "height=" 189 "/>

Mmap and cow technology are used in the design, so the overall architecture is relatively simple, there are no other components such as cache management, log management, external memory management.

Mmap file mapping is the basis, and lmdb by a read-only file mapping (default) Avoids situations where the database is corrupted due to an application bug. It on the

Some infrastructures, such as locktables,mvcc,cow, are the basis for the realization of transaction control, and through these theoretical bases, Lmdb realizes the complete

The ACI attribute, D is implemented through MMAP. Finally, the system provides a way to operate on B+tree, using cursor cursors. Can be used for additional deletions and checks.

All changes in B+tree are similar to other implementations, but Lmdb is based on append only B+tree, and subsequent series will explain in detail Lmdb

B+tree implementation method.

The main system data structures are:

1. mdb_env

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; margin:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M01/73/03/ Wkiom1xyjxxzqspzaaganpi5k7e017.jpg "width=" 239 "height=" 443 "/>

2. Mdb_envinfo

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; margin:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M02/73/03/ Wkiom1xyjxwiaxnbaacxlo8nezi596.jpg "width=" 225 "height=" 195 "/>

The first key is the Env object, the Environment object represents an actual physical storage file. Lmdb is stored in direct b+tree storage, and the indexes and values are stored in B+tree's

page, for objects with a particularly large value are resolved through the overflow page. Rather than as other storage methods, indexes are separated from data. But Lmdb storage supports storing multiple b+tree in the same physical file,

and a leaf node that can use one b+tree as another b+tree, that is, the sub-database concept that it mentions. This is useful for certain applications, such as simple hierarchical data structures.

Management and querying.

Important data members in the Env object are:

Me_dirty-list: Dirty page list, which is a list of all pages that have been modified by the write transaction but not submitted to the physical file.

Me_free_pgs: Available pages, available pages are used to control MVCC caused by the file size expansion, the available page refers to no transaction has been used but has been modified, according to MVCC principle, it is already an old version of the page.

For databases that need access to historical data, such as requirements that need to be restored to any point in time, all older versions should be saved, and for database systems that only need to keep up-to-date consistent data such as Lmdb, these

Pages can be reused, and page reuse can effectively avoid the infinite increase of physical files. Free_pgs a list of reusable pages caused by the current write transaction.

Me_metas: Metadata list, Lmdb uses two pages as a meta page, so its size is 2. One of the main functions of the Meta page is to hold the B+tree root_page pointer. Its internal use of cow technology, root

The page pointer may be modified, so you can use two different pages to switch to save the latest page, similar to the Double-buffer design. As a result, although Lmdb supports multiple b+tree in a file, the meta-page

Limit, the number of which is limited.

Me_rmutext,me_wmutex: Lock table Mutex, Lmdb can support multi-threaded, multi-process. Synchronous access between multiple processes is achieved through mutual exclusion at the system level. The mutex itself exists in the shared memory of the system rather than the memory of the process itself, so

In the read-write page, first access to the lock table to see if the corresponding resources have other processes, threads are in progress, and some need to be queued according to the requirements of the transaction rules. Details of the lock use are explained below.

Me_txn,me_txns: The current list of transactions used in the environment, an Env object belongs to a process, a process may have multiple threads using the same env, each thread can open a transaction, so the Env object at a process level needs to maintain the TXN list

To see how many threads and transactions are currently working.

Me_flags: Flags, flags that control many of the behavior of the database, must be set before each use of Env, the application should use flags in a consistent manner, or the database may have unpredictable errors.

Me_dbxs: Database objects

ME_USERCTX: User data, user context data, is primarily used to assist with key comparisons.

The members of the Envinfo object are relatively straightforward, not much verbose.

3. Mdb_meta

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; margin:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M00/73/03/ Wkiom1xyjxbqtfnjaacz3wfbejg555.jpg "width=" 177 "height=" 208 "/>

The update of the meta page is determined by the transaction ID, the read transaction does not update the metadata page, and the write data may be updated.

Meta page recycling, that is, ID 1, modify page 1,id to 2, modify page 0.

Its most important data member is

Mm_dbs: Database B+tree root, save two, 0 is the currently used alternative root page pointer, 1 is the main database currently in use.

Mm_version: Version of the current lock file, which is an important member of the implementation MVCC, must be set to Mdb_data_version.

4. Mdb_page

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; margin:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M01/73/03/ Wkiom1xyjxagepnpaabidgorbi0653.jpg "width=" 156 "height=" 174 "/>

Page describes the headers of different pages. Whether it is root in the tree, or the branch, leaf page, it is described by it.

For overflow pages, only the first page is described using a header, and subsequent successive pages are not used, just using pointers

Associate the page.

Important Members:

Mp_flags: What type of page is represented?

Mp_pb:overflow pages or available space on the current page

5.mdb_node

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; margin:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M02/73/03/ Wkiom1xyjxaiyqsuaacckxhekdi004.jpg "width=" 197 "height=" 171 "/>

Node represents Key/value, a description of the data in the branch, leaf page

Key members include:

MN_FLAGS: Flag: Whether duplicate, sub-database, overflow, etc.

Mn_hi.lo: Data size or page number

Mn_data: Data pointer

6. mdb_db

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; margin:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M00/73/03/ Wkiom1xyjxasaho4aac9jkikevq134.jpg "width=" 213 "height=" 228 "/>

Mdb_db describes a single b+tree number that contains some relevant information and the root node page number.

7. Mdb_txn

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; margin:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M01/73/03/ Wkiom1xyjxei89tlaacw1ftkit8822.jpg "width=" 148 "height=" 244 "/>

MDB_TXN describes the data structure of a transaction, and transactions in the MDB support nested transactions. Supports full ACID properties,

However, only the serializable transaction isolation level is supported, and only one transaction write is allowed to be controlled by the same env-corresponding database.

Transactions operate in a similar way to other databases, and nested transactions must match.

Its key members include:

Mt_child,parent: Transaction Nesting parent-child relationship

Mt_next_pgno: Unassigned Page ID

Mt_cursor: Writes a cursor that is already open in each database in the transaction.

8. Mdb_cursor

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; margin:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M02/73/03/ Wkiom1xyjxesafvcaaczw2mncu0829.jpg "Width=" 192 "height=" 244 "/>

Cursor objects are objects that perform all database operations, and both read-write are cursor-based. To read and write operations, you first need to base on the conditions

Determines the page position to obtain a cursor in which the application operates the database based on the cursor object.

Its key members are:

Mc_next: A list of cursors in the same transaction about the same db. Next point to the next cursor

Mc_top: Top Page ID

Mc_xcursor: Used for key repeatable b+tree.

Mc_pg:cursor open page composed of a stack, up to 32, the specific role remains to be explored.

Mc_ki: Index of all open pages

This article simply describes the overall architecture of Lmdb and the important data members. This series next introduces the mmap principle and

How it is used in Lmdb

Lightning MDB Source Code analysis (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.