Lmdb Introduction--B + Tree embedded database combined with MVCC

Source: Internet
Author: User

Lmdb Introduction

Lmdb is an embedded (embedded as a library into the host program) storage engine developed by the OPENLDAP project. The main features are:

    • File map based IO (mmap)
    • Key-value interface based on B + Tree
    • Transaction processing based on MVCC (Multi Version Concurrent Control)
    • Class BDB (Berkeley DB) API
Implementing the underlying read-write implementation

The basic idea of lmdb is to use mmap to access storage, whether the storage is in memory or on persistent storage.

All read operations of Lmdb are directly accessed by mmap to the address space of the host process by loading the files to be accessed read-only, which reduces the copy between the hard disk, the kernel address control and the user address space, and simplifies the implementation on the flat "index space". Because of the use of Read-only's mmap, it avoids the risk that the host program incorrectly writes the storage structure bad. The scheduling of Io is done by the operating system's page scheduling mechanism.

The write operation is done through the write system call, mainly to take advantage of the operating system's file system consistency and avoid synchronization on the address being accessed.

MVCC-based storage engine

As we mentioned earlier, the read operation on the Lmdb directly reads the memory address loaded by the mmap, so if the read content is modified, does the inconsistency result occur? The fact is that everything on the Lmdb is not modified.

Lmdb uses MVCC to handle concurrent read and write access issues. The point is:

    • Each change corresponds to a version
    • The original version is not modified when the change occurs
    • When the reader enters, it gets a version that reads only this version of the content

For a tree-shaped data structure, when a change occurs on one of its nodes, a new node is created, and the changes are accommodated on the new node, as the parent node of the node is changed (from pointing to the original node to point to the new node), then the process is repeated, that is, Any node that actually changes the node leading to the root node must recreate a copy, and when the change is done, we commit the change version through an atomic operation. This is basically a process:

As shown, each new version produces a new heel node, which, as described above, preserves all versions of history in the final store, and of course, all versions include the version currently read by all readers, so the changes will not have any effect on the reader, so the write can be blocked without being read.

Above, we discussed the reading situation, the above method promises to read a consistent version of each (that is, it gets into the version), but did not promise to give it a new version, we consider in a transaction, based on a value change another value of the case, obviously, when we want to commit the change, It is possible that the version we get when we enter is not up to date, that is, another commit has occurred between our entry and submission, in which case, if a change is committed, an inconsistent condition, such as a monotonically incrementing counter, can potentially "eat" multiple increments. To solve this problem, we just have to check if we have the latest version at the time of submission, which can often be done through a CAS atomic operation, and if this fails, re-enter the store and redo the entire transaction. In this way, reading can also be blocked by (possibly) writing.

As described above, is it necessary to have all the historical versions stored in our store? In fact, we save the historical version because it is possible for readers to read it, new readers always read the latest version, the old version is useless, if there is no reader (and writer) on a version, then it is not necessary to exist. We can implement the old version of the recovery based on the above principles, but Lmdb has made some improvements:

    • A basic fact is that new readers are always going to read the latest version, so it is unnecessary to keep all versions of the root node, we just need to save the latest version of the root node and a node for committing more new changes.
    • When all readers enter, copy a snapshot of the current root node (as it is likely to be changed later)
    • When a change is committed, it is clear that, because of this commit, which nodes will become useless sooner or later---all the nodes that have been modified because of this change will become useless nodes after the corresponding version of the reader exits, so you can reclaim
    • For these reasons, the latest version can collect all the nodes that need to be recycled and the version they belong to
    • Maintains a reader's slot, which can be retrieved from the smallest version of this version, which can be recycled from a smaller version

As mentioned above, we now have only two root nodes, and all changes will eventually have to modify the root node so that all writes are actually serialized. This does not degrade performance, for the reason that, as we mentioned above, when two changes are performed concurrently, the exact point is to go to the same version and make some changes based on this version, and then to commit the changes, one of the two must be re-committed, because there is another commit between its submission and entry, This conclusion can be generalized to multiple concurrent situations. In other words, the change is actually serialized, and because there is no blocking between the different changes, the MVCC scheme consumes more compute resources (all failed commits are re-made). As a result, Lmdb has serialized all the change operations with a lock.

These are the most important points in lmdb implementation.



Original link: http://www.jianshu.com/p/yzFf8j

The data generated by LEVELDB and Lmdbcaffe is divided into 2 formats: Lmdb and LEVELDB.
They are key/value pairs (key/value pair) embedded database management System programming library.
while Lmdb memory consumption is 1.1 times times leveldb, Lmdb is 10% to 15% faster than LEVELDB, and more importantly LMDB allows multiple training models to read the same set of data simultaneously.

As a result, Lmdb replaces leveldb as the default dataset generation format for Caffe.

Leveldb has some of the following features:
First, Leveldb is a persistent storage kv system, unlike Redis's memory-type KV system, LEVELDB does not eat as much memory as Redis, but instead stores most of the data on disk.
Second, when storing data, levledb is stored in order according to the key value of the record, that is, the adjacent key value is stored sequentially in the stored file, and the application can customize the key size comparison function, and Levledb stores the records sequentially according to the user-defined comparison function.
Again, like most KV systems, the Leveldb interface is simple, with basic operations such as writing records, Reading Records, and deleting records. Atomic bulk operations for multiple operations are also supported.

In addition, the LEVELDB supports the data snapshot (snapshot) feature so that read operations are not affected by write operations, and consistent data can always be seen during a read operation.

In addition to this, LEVELDB supports operations such as data compression, which can be directly helpful in reducing storage space and increasing the efficiency of the IO. Leveldb performance is very prominent, the official website reported that its random write performance of 400,000 records per second, and random read performance of 60,000 records per second. In general, Leveldb writes are much faster than read operations, while sequential read and write operations are much faster than random read and write operations.

Lmdb Introduction-a B + tree embedded database combined with MVCC

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.