Exploration of database compression technology

Source: Internet
Author: User
Tags create index memory usage

As a database, in the system resources (CPU, memory, SSD, disk, etc.) certain premise, we want to:

    • stored data more: using compression, the world has a variety of compression algorithms;
    • Faster access: faster compression (write)/extract (read) algorithm, larger cache.

Almost all compression algorithms are heavily dependent on the context:

    • Location adjacent data, in general, the correlation is higher, the intrinsic redundancy is greater;
    • The larger the context, the greater the upper limit of the compression rate (with limit values).
Block compression block compression technology in traditional database

For ordinary data block/file-based compression, the traditional (streaming) data compression algorithm works well, long time, we are also accustomed to this data compression mode. The data compression algorithm based on this pattern is endless, and the new algorithm is implemented continuously. Includes the most widely used gzip, bzip2, Google Snappy, rookie zstd, and more.

    • Gzip support on almost all platforms, and has become an industry standard, compression rate, compression speed, decompression speed are more balanced;
    • Bzip2 is a kind of compression based on BWT transform, the essence is on the input block, each block is compressed separately, the advantage is high compression rate, but the compression and decompression speed is relatively slow;
    • Snappy is produced by Google, the advantages of compression and decompression are very fast, the disadvantage is that the compression ratio is low, applicable to the compression rate of the real-time compression scene;
    • LZ4 is snappy a strong competitor, faster than snappy, especially the decompression speed;
    • ZSTD is a compression rookie, compression rate is much higher than LZ4 and snappy, compression and decompression speed is slightly lower, compared to gzip, compression rate is comparable, but compression/decompression speed is much higher.

For the database, in the computer world too ancient, for I/O optimization btree has been non-volatile, for the disk optimization of btree block/page size is large, just let the traditional data compression algorithm can get a larger context, so, based on block/ Page compression is also naturally applied to a variety of databases. In this savage era, memory performance, capacity and disk performance, capacity is distinct, a variety of applications to the performance of the demand is relatively small, everyone is peaceful.

Now, we have SSD, PCIe SSD, 3D Xpoint, and so on, the memory is also more and more big, the disadvantage of block compression is increasingly prominent:

    • Block selection is small, compression rate is not enough, block selection is big, performance can not endure;
    • More deadly, block compression saves only larger and cheaper disks, SSDs;
    • More expensive and smaller memory is not saved, but more wasteful (double cache problem).

Therefore, for many applications with high real-time requirements, compression can only be turned off.

the principle of block compression

Using general-Purpose compression techniques (Snappy, LZ4, Zip, bzip2, ZSTD, etc.), compressed by block/page (block/page) (block size is usually 4kb~32kb, the tokudb block size known as the compression rate is 2MB~4MB), this block is a logical block, Rather than the kind of physical blocks in memory paging, block device concepts.

When compression is enabled, the resulting decrease in access speed is due to:

    • When writing, many records are packaged together to compress into blocks, increasing the size of the block, the compression algorithm can obtain a larger context, thereby increasing the compression rate; Conversely, reducing the block size decreases the compression rate.
    • When reading, even if the data is very short, it is necessary to extract the entire block before reading the extracted data. Thus, the larger the block size, the greater the number of records contained within the same block. To read a single piece of data, the more unnecessary the decompression is, the worse the performance will be. Conversely, the smaller the block size, the better the performance.

Once the compression is enabled, in order to alleviate the above problems, the traditional database generally need a large private cache, used to cache the extracted data, which can greatly improve the performance of thermal data access, but also caused a double cache space consumption problem: First, the operating system cache of compressed data The second is the extracted data in the private cache (for example, Dbcache in Rocksdb). There is also a very serious problem: the private cache is the cache, when the cache misses, still need to extract the entire block, which is a major source of the slow query problem (another major source of slow query is when the operating system cache misses).

All this leads to the existing traditional database in the speed of access and space consumption is a long-time, can not be completely resolved problems, only a few compromises.

block compression of Rocksdb

Take Rocksdb as an example, the blockbasedtable in Rocksdb is a block compressed sstable, using block compression, the index is only anchored to the block, the size of the block is set in DBOption, a block contains multiple (key,value) data, such as M Bar, This reduces the size of the index to 1/m:

    • The larger the m, the smaller the size of the index;
    • The larger the size of the block, the greater the context the compression algorithm (gzip, snappy, etc.) can get, and the higher the compression rate.

When creating the Blockbasedtable, Key value is filled in buffer, and when the buffer size reaches the predetermined size (block size, of course, the general buffer size does not exactly equal the preset block size), The buffer is compressed and written to the Blockbasedtable file, and the file offset and the first key in buffer (create index to use), if the single data is too large, larger than the preset block size, This data takes up a single block (no matter how big the individual data is not divided into multiple blocks). After all key value has been written, an index is created based on the starting key and file offset of each block previously recorded. So in the blockbasedtable file, the data in front, index in the end of the file contains meta-information (function equivalent to the usual fileheader, just the location at the end of the file, so called footer).

Search, first use Searchkey find Searchkey block, and then to the DB cache to search for this block, find and then further in the block search Searchkey, if not found, from the disk/SSD read this block, extracted into the DB cache. There are several implementations of the DB Cache in the Rocksdb, including the LRU cache, the clock cache, the counting cache (to count the cache hit rate, etc.), and some other special caches.

In general, the operating system will have a file cache, so the same data may be both in the DB cache (extracted data), and in the operating system cache (compressed data). This can result in a waste of memory, so ROCKSDB provides a tradeoff: setting the Direct_io option in DBOption, bypassing the operating system cache, so that only the DB cache, can save a portion of memory, but to a certain extent degrades performance.

Traditional non-mainstream compression: Fm-index

Fm-index is full Text Matching Index, which belongs to the succinct data structure family, has certain compression capabilities and can perform search and access directly on compressed data.

Fm-index's function is very rich, history has been quite a long time, not a new technology, in some special scenarios have been widely used, but for various reasons, has been tepid. In recent years, Fm-index has started to be active, first of all on GitHub, where a full set of succinct algorithms was implemented, including Fm-index, followed by succinct for Berkeley Fm-index projects.

Fm-index belongs to the offline algorithm (one-time compression of all data, after compression can not be modified), generally based on BWT transformation (BWT transformation based on the suffix array), compressed Fm-index support the following two most important operations:

    • Data = Extract (offset, length)
    • {Offset} = search (string), returning multiple matching string positions/offsets (offset)

Fm-index also supports additional operations, and interested friends can investigate further.

However, in my opinion, Fm-index has several fatal drawbacks:

    • The implementation is too complex (this can be overcome by a few Daniel, not mentioning it);
    • Low compression rate (too much less than streaming compression such as gzip);
    • Search and access (extract) are slow (on the fastest CPU i7-6700k in 2016, the single-threaded throughput rate does not exceed 7mb/sec);
    • The compression process is slow and consumes memory (Berkeley succinct compression process memory consumption is more than 50 times times the source data);
    • The data model is flat Text, not the keyvalue model of the database.

The flat model can be converted into a keyvalue model in a simple way: Pick a character "#" that will not appear in key and value (if you cannot find such a character, you need to escape the code), each key is inserted before and after the character, Key is immediately followed by value. So, search (#key #) returns to the location where #key# appears, and we can easily get to value.

Berkeley's Succinc project in Fm-index's flat text model to achieve a richer ranks (Row-column) model, paid a great effort to achieve a certain effect, but the difference is too far away from the practical.

Interested friends can carefully investigate the fm-index, to verify the author's summary and judgment.

Terark retrievable Compression (searchable Compression)

Terark Company proposed the concept of "retrievable compression (searchable Compression)", whose core is to perform search and access (extract) directly on compressed data, but the data model itself is the keyvalue model, According to its test report, the speed is much faster than the Fm-index (two orders of magnitude), the specific description:

    • The block compression technology of traditional database is discarded, and the global compression is adopted.
    • Use different global compression techniques for key and value;
    • Use the global compression technology with search function for key coindex (corresponding to Fm-index search);
    • The Pa-zip (corresponding to Fm-index's extract) uses a fixed-point global compression technique for value.
compression on key: Co-index

We need to index the key to effectively search and access the required data.

Normal indexing technology, the size of the index relative to the index of the original key size is much larger, some indexes using prefix compression, to some extent, can alleviate the expansion of the index, but still cannot solve the problem of too large index consumption memory.

We propose the concept of co-index (compressed Ordered Index) and practice this concept through a data structure called nested succinct trie.

Compared to the traditional implementation of the data structure of the index, Nested succinct trie occupies a small space of more than 10 times times or even dozens of times times. While maintaining this compression rate, it also supports a wide range of search functions:

    • Precise search;
    • Range search;
    • Sequential traversal;
    • Prefix search;
    • Regular expression searches (not traversal-by-article).

Co-index also has its advantages over fm-index (assuming that all data in the Fm-index is key).


Table 1 Fm-index Contrast Co-index the principle of Co-index

In fact, we have implemented many kinds of co-index, of which nested succinct trie is the most applicable one, here is a brief introduction to its principle:

Succinct Data Structure Introduction

Succinct Data structure is a technique that can be used to express objects in a space close to the lower bounds of the information theory, usually represented by bitmaps, with rank and select on the bitmap.

While the memory footprint can be greatly reduced, it is more complex to implement and has a much lower performance (a constant term with a large time complexity). Currently open source has sdsl-lite, we use their own implementation of the Rank-select, performance is also higher than the open source implementation.

Take two fork tree as an example

The traditional representation is that a node contains two pointers: struct Node {node *left, *right;};

Each node occupies 2ptr, if we optimize the traditional method, the node pointer is expressed with the minimum bits number, n nodes need 2*[LOG2 (n)] bits.

    • Compared to the traditional basic version and the traditional optimized version, assuming a total of 216 nodes (including null nodes), the traditional optimized version requires 2 bytes, the traditional basic version needs 4/8 bytes.
    • Compare traditional and succinct, assuming a total of 1 billion (~230) nodes.
    • Traditional optimized edition each pointer occupies [log2]]=30bits, total memory consumption: ($\frac{2*30}{8}$) *230≈7.5GB.
    • Using succinct, Occupy: ($\frac{2.5}{8}$) *230≈312.5MB (2.5 bits per node, where 0.5bits is the space occupied by the Rank-select index).

Succinct Tree

There are many ways to express succinct tree, and here are two common types:


Figure 1 Example of succinct tree expression

Succinct Trie = succinct Tree + Trie Label

Trie can be used to achieve index, figure 2 This succinct trie is louds expression, which holds hat, is, it, a, four keys.

Patricia trie plus nesting

Using only succinct technology, compression rates are far from enough, so path compression and nesting are applied. As a result, the compression rate is on a new level.

The combination of the above technologies is our nest succinct Trie.

Compression on Value: Pa-zip

We developed a compression technique called pa-zip (Point Accessible Zip): Each data is associated with an ID, and when the data is compressed, the data can be accessed with the corresponding ID. Here, the ID is the point, so it's called Point Accessible Zip.

Pa-zip globally compresses all value in the entire database (a collection of all the value in the KeyValue database), rather than pressing Block/page. This is a database-specific requirement (KeyValue model), a compression algorithm specifically designed to address traditional database compression problems:

Compression rate is higher, there is no double-cache problem, as long as the compressed data into the memory, do not need a dedicated cache, you can directly read a single data by ID, if this read a single data as a decompression, then:

    • By ID sequence decompression, decompression speed (throughput) generally in 500MB per second (single-threaded), up to about 7GB/S, suitable for offline analytical needs, traditional database compression can do this;
    • Random decompression by ID, the decompression speed is generally 300MB per second (single-threaded), up to about 3gb/s, suitable for online service needs, this is a victory over traditional database compression: Random decompression 300mb/s, if each record average length 1K, equivalent to QPS = 300,000 If each record has an average length of 300 bytes, which is equivalent to QPS = 1 million;
    • Preheating (warmup), in some special scenarios, the database may need to be warmed up. Because of the removal of the dedicated cache, the TERARKDB is relatively simple and efficient preheating, as long as the mmap memory preheating (Avoid page fault can), the database is loaded successfully after the warm-up, This preheated throughput is the SSD continuous read IO performance (newer SSD read performance exceeds 3gb/s).

Compared to Fm-index, Pa-zip solves Fm-index extract operations, but the performance and compression ratios are much better:


Table 2 Fm-index Contrast pa-zip Combine key with value

Key is stored in Co-index as a global compact, and value is stored in Pa-zip as a global compression. Search for a key, you will get an internal ID, according to this ID, go to pa-zip to access the ID corresponding to the value, the entire process only touch the required data, no need to touch other data.

This eliminates the need for dedicated caches (such as Dbcache in Rocksdb), uses only mmap, perfectly fits the file system cache, and the entire DB only mmap file system cache, coupled with ultra-high compression ratios, dramatically reduces memory usage and greatly simplifies system complexity. Finally, the performance of the database is greatly improved, so that the high compression rate and high random read performance are achieved simultaneously.

From a higher philosophical level, our storage engine is very much like the construction method deduced, because Co-index and pa-zip closely cooperate, perfect match keyvalue model, function "Just enough", performance press hardware limit, compression rate approximation to the lower limit of information theory. Compared to other scenarios:

    • Traditional block compression is derived from general-purpose streaming compression, the function of streaming compression is very limited, only the compression and decompression of two operations, too small data block no compression effect, and can not compress the redundancy between the data blocks. Using it on a database requires a lot of engineering effort, like putting a car on an airplane wing and then letting it fly.
    • In contrast to Fm-index, the situation is the opposite, Fm-index's function is very rich, it will inevitably pay some price-compression rate and performance. And in the KeyValue model, we only need a very small subset of its rich functions (and also to be adapted and transformed), and the other more functions are useless, but still pay the price, as we spent a very high price to build a plane, but put it on the ground, only the wheel run, when the car used.


Fig. 2 succinct Tree expressed in louds way


Figure 3 Path compression and nesting Appendix Compression ratio & performance test comparison

Datasets: Amazon Movie Data

Amazon Movie Data (million reviews), the total size of the dataset is about 9GB, the number of records is about 8 million, the average length of each data is about 1 K.

Benchmark code Open Source: see the GitHub repository.

    • Compression ratio (see Figure 4)


Fig. 4 Comparison of compression ratios
    • Random Read (see Figure 5)


Figure 5 Comparison of Random read performance

This is the performance of each storage engine in case of sufficient memory.

    • Delay curve (see Figure 6)


Fig. 6 Comparison of delay curves

Data set: Wikipedia English version

Wikipedia English version of all text data, 109G, compressed to 23G.

Data set: Tpc-h

On Tpc-h's LineItem data, compare tests using TERARKDB and the original Rocksdb (blockbasedtable):


Table 3 Terarkdb and the original ROCKSDB comparison test API Interface

TERARKDB = Terark sstable + rocksdb

Rocksdb was originally a fork in Facebook's leveldb for Google, and the programming interface was compatible with LEVELDB, adding a lot of improvements.

Rocksdb useful to us is that its sstable can be plugin, so we realized a rocksdb sstable, our technical advantage through the ROCKSDB play out.

Although ROCKSDB provides a relatively complete keyvaluedb framework, there are still some shortcomings to fully adapt our unique technology, so we need to make some changes to the ROCKSDB itself. In the future, one day we will submit our own changes to the official version of ROCKSDB.

GitHub Link: terarkdb (https://github.com/Terark/terarkdb), TERARKDB consists of two parts:

    • Terark-zip-rocksdb (HTTPS://GITHUB.COM/TERARK/TERARK-ZIP-ROCKSDB), (Terark sstable forrocksdb)
    • Terark Fork Rocksdb (HTTPS://GITHUB.COM/TERARK/ROCKSDB), (must use this modified version of ROCKSDB)

For better compatibility, TERARKDB has made no changes to Rocksdb's API, and we have even provided a way to make it easier for users to use TERARKDB: The program does not need to be recompiled, only needs to be replaced Librocksdb.so and set several environment variables, you can experience terarkdb.

If you need finer control, you can use the C + + API to configure the various options of TERARKDB in detail.

At present, you can try for free, you can do performance evaluation, but not for production, because the trial version will randomly delete 0.1% of the data.

Terark Command Line Toolset

We provide a set of command-line tools that compress input data into different forms, and compressed files can be decompressed or fixed-point accessed using the Terark API or other command-line tools (in the Toolset).

Exploration of database compression technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.