Research results for LEVELDB's "upgraded" Storage engine ROCKSDB

Source: Internet
Author: User
Tags prefetch ranges

Google's leveldb is a great storage engine, but there are still some unsatisfactory places, such as LEVELDB does not support multi-threaded merge, key range lookup support is very simple, no optimization measures, and so on. And Facebook's rocksdb is a much more ferocious engine, actually the improvements made on top of the leveldb, in the usage of leveldb very similar, the comparison of the two can refer to the following reference 1.

Here the reason to investigate Rocksdb is because rocksdb in the implementation of prefix Bloomfilter, can support the optimization of the scope of the search for my current project is very useful, the following is my research and analysis of the rocksdb part of the source summed up the results.


1. Research findings related to Bloomfilter in Rocksdb

This step mainly refers to Rocksdb's official blog and related discussions, summarizing the following information:

(1) ROCKSDB supports setting the Bloomfilter on key Sub-part, which makes the scope query possible.

(2) Divide the key into prefix and suffix, configure a prefix_extractor to specify Key-prefix, and use this to store key-prefix for each blooms, The prefix filter is then implemented using these bloom bits to avoid querying keys that do not contain the specified prefix with the prefix iterator specified.

(3) Rocksdb implements two Bloomfilter, one is to filter the blocks without key before reading block (same as LEVELDB). The other is to dynamically generate a bloomfilter to implement the key filter in memory when querying memtable (before block read).

The above information sources mainly come from the following references:

    • Official Blog
    • Discussion on the characteristics of ROCKSDB in Hacknews
    • Rocksdb Basics


2. Optimization of Get interface in Rocksdb (compared with leveldb)

Following a brief summary of the ROCKSDB in the implementation of the Get interface some of the optimization techniques, the overall implementation process and leveldb consistent, are memtable->immemtable->sstable process, but the implementation of the details are different, There are several main differences:

(1) Get implementation of memtable/immemtable (Memtable.cc::get)

Rocksdb added the bloomfilter mechanism in this process, as follows:

if (prefix_bloom_&&

!prefix_bloom_->maycontain (Prefix_extractor_->transform (User_key))) {

ITER is null if prefix bloom says Thekey does not exist

} else {

Query memtable

}

This bloomfilter is dynamically generated (not persisted) and is prefix bloom, filtered according to prefix.


(2) Get implementation in Sstable: Level->file block-by-layer search

A. On level 0, pre-read is added before files are found (prefetch Bloom filter Data blockfor L0 files)

Prefetch table data to Avoidcache miss if possible

if (level = = 0) {

for (int i = 0; i < num_files; ++i) {

auto* R =files_[0][i]->fd.table_reader;

if (r) {

R->prepare (Ikey);

}

}

}

The prefix hashing technology is used (reference 2).

b Then find the possible files (find in the same way as LEVELDB) at each level, and the file on the key range filtering and fractional cascading technology to optimize the search , but to meet two conditions: one is not only one L0 layer, the second is L0 layer must have 3 files above, that is, if the L0 layer less than 3 files, do not do a key range filtering, because in this case the system every query table number has been very small, so this time key range Filtering probably did not directly query the files efficiently.

Key range filtering is very simple, is to see the key is not in file [Smallest_key,largest_key], and fractional cascading technology is simply using the upper key range Filtering comparison information as a reference to the next layer key range filtering to reduce the number of comparisons, making it faster to locate the next layer of files, see reference 3. After locating the file, the block query will be made, and the block lookup in Rocksdb (block_based_table_reader.cc) uses the same bloomfilter mechanism as LEVELDB.

In addition, there are a lot of rocksdb and leveldb different places, such as Rocksdb in the memtable data structure in addition to skiplist implementation of linked list, sstable implementation, in addition to block Besides table, plain TABLE;ROCKSDB supports multi-threaded merging, enabling multiple instances in a single process, adding a merger interface in addition to the basic Put/get/delete interface, etc...


3. Implementation details of prefix bloomfilter in Rocksdb

After studying the source code of ROCKSDB, I summarize the rocksdb implementation of Prefixbloom in the perspective of my own understanding as follows:

(1) There are two storage formats for persistent data in ROCKSDB: The blockbasedtable format and the plaintable format, in which the blockbasedtable format derives from the LEVELDB format in the version blocktable. The overall format has not changed at all, as follows:

<beginning_of_file>

[DataBlock 1]

[DataBlock 2]

...

[DataBlock N]

[Metablock 1:filter block]

[Metablock 2:stats Block]

...

[Metablock k:future extended Block]

[Metaindexblock]

[Indexblock]

[Footer]

<end_of_file>

However, the implementation is different from the leveldb, such as the red labeled Filter block section, the LEVELDB filter block section can store all key bloomfilter, and the ROCKSDB filter The block section can store not only the bloomfilter of all keys, but also the bloomfilter of all key prefix, controlled by two parameters Whole_key_filtering_ and Prefix_extractor_, Where Whole_key_filtering_ controls whether the bloomfilter of the entire key is stored, and Prefix_extractor_ controls whether to store prefix bloomfilter. If you want to store prefixbloomfilter, you need to save prefix length information in prefix_extractor_ in order to Filterblock The building process can extract key prefix from the length information and generate Prefixbloomfilter, and a prefixmaymatch () function is used to filter the prefix (only leveldb () in Keymaymatch).

Note: In addition to the filter block implementation is different, the following Iindexblock implementation is different, Rocksdb added the Prefixindex block implementation, Prefixindex Block saves an index record for the prefix portion of each key in the datablock to facilitate searching through prefix.

(2) After the filter block building is complete, the prefix scan can be performed as follows:

    Autoiter = Db::newiterator (Readoptions ());    For (ITER. Seek (prefix); Iter. Valid () && Iter.key (). startswith (prefix); Iter. Next ()) {       //do something    }

The specific implementation through the encapsulation of the various types of ITER internal iterator seek method, wherein the use of Prefixbloomfilter iterator is sstable twoleveliterator (that is, the filter is disk IO), two_ The Seek method in Level_iterator is preceded by a prefixfilter before reading disk IO, as follows (two_level_iterator.cc:: Seek):

if (State_->check_prefix_may_match &&     !state_->prefixmaymatch (target)) {   Setsecondleveliterator (nullptr);    return;  }

Here the specific implementation of the Prefixmaymatch function is divided into the following steps (block_based_table_reader.cc:: Prefixmaymatch):

A. First extract the prefix part of key based on prefix_extractor information

B. Then construct the index iterator of the prefix to find out if the prefix is possible in this file based on the index information (no real block reads at this time, i.e. no disk IO operations at this point)

C. If it is not possible to return false in file, if possible, then further check whether the prefix of the complete key pointed to by the current iterator is the prefix to find (because index can only determine the scope, cannot determine precisely prefix must exist), If it returns true, otherwise get filterblock in the Bloomfilter, through the prefixbloomfilter prefixmaymatch filtering, if not filtered to start a real block disk lookup.


The above process simply describes how to implement prefix scan, and here is a simple example (from db_test.cc):

Generate 11 SST files using the following set of Prefixranges:

GROUP 0:[0,10] (Level 1)

GROUP 1:[1,2], [2,3], [3,4], [4,5], [5, 6] (level 0)

GROUP 2:[0,6], [0,7], [0,8], [0,9], [0,10] (level 0)

The key ranges corresponding to these 11 prefix ranges are:

GROUP 0: [00______:start, 10______:end]

GROUP 1: [01______:start, 02______:end], [02______:start, 03______:end],

[03______:start, 04______:end], [04______:start, 05______:end],

[05______:start,06______:end]

GROUP 2: [00______:start, 06______:end], [00______:start,07______:end],

[00______:start,08______:end], [00______:start, 09______:end],

[00______:start,10______:end]

Where the prefix length is 8, if you want to prefix "03______:" to find the 11 SST files, the previous API (such as LEVELDB) requires 11 random io to find, With the new API and Prefixfilter option enabled in Rocksdb, we only need 2 random IO, because only two files contain the prefix.


4. About Get_range interface in Rocksdb

Although the prefix bloomfilter is implemented in Rocksdb, the Get_range interface is not provided, and the official document says that the support Bloomfilter scope query refers to ROCKSDB has implemented prefix bloomfilter, The user can then take advantage of this filtering mechanism for scope lookups, but the interface needs to be implemented by the user itself. Rocksdb to the original leveldb in the SST file reserved metablock for the specific use, wherein the prefixes information exists metablock (block_based_table_builder.cc). So we can draw on the principle of prefixbloomfilter to achieve our own range bloomfilter.


5. Preliminary thinking on the realization of Bloomfilter in LEVELDB range

The first Get_range external interface is this:

int get_range (int area, const data_entry &pkey, const data_entry &Start_key ,   

Const Data_entry & End_key , int offset, int limit, vector<data_entry*>

&values,short Type=cmd_range_all);

The Pkey is prefix key, so we implement the filtering of the range bloomfilter according to the Pkey implementation bloomfilter.

The basic implementation ideas are as follows:

(1) Extract the appropriate prefix for each key in the data block

(2) The prefix key implementation of the Bloomfilter (as with the key implementation), and added to the filter block, which can be placed with the entire key bloomfilter, can also be divided into open, through the index block control indexes

(3) in the Get_range implementation process, the first to obtain the prefix bloomfilter, and then prefixfilter the Pkey, filtering out the prefix mismatch of file or block, so that the scope Bloomfilter achieved.


6. References

1. Rocksdb Introduction: An engine that is more ferocious than leveldb

2.Prefix hashing in rocksdb-speeding-queries for special workloads

3. Use fractional cascading to optimize file lookups on level

4.TheStory of Rocksdb

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.