Performance report and technical analysis of TERARKDB database

Source: Internet
Author: User
Tags benchmark mixed open mmap elastic search

I believe many people have seen the hot American drama "Silicon Valley", which describes the future technology is that can be compressed data on the search, without the need to extract data beforehand. In reality, we are developing this technology. Based on this core technology, we have released the storage Engine product terarkdb, which has very high technical barriers. Our goal is to go beyond Facebook's rocksdb,google Leveldb,mongodb Wiredtiger to make the world's best-performing storage engine.

TERARKDB Introduction

The TERARKDB is a storage engine with very high performance and data compression rates. This is similar to Facebook's rocksdb, but with more functionality than ROCKSDB, here are the features of TERARKDB:

    • High compression rate, usually 2~5 times of snappy
    • Real-time free decompression to retrieve data directly
    • Query latency is low and stable
    • The same Table can contain multiple indexes, support federated indexes, support scope search
    • Native support for regular expression retrieval
    • Supports embedding process, or server-client mode
    • Data persistence
    • Supports Schema with rich data types
    • column storage and row storage, supported by column Group

TERARKDB has a wide range of applications in the Internet as well as in traditional industries. Because TERARKDB is optimized for read operations, it is more suitable for reading less and writing large volumes of read-in scenarios.

The TERARKDB approach is quite flexible and can be used as a standalone library to accommodate customer-specific scenarios. Download packages and Docker are available to facilitate user downloads. Currently supports linux,windows and Mac OS OS.

TERARKDB, as a storage engine, has its own native interface, while providing a compatible LevelDB interface that can be adapted to all systems and applications that use LevelDB, such as SSDB for most Redis interfaces. In addition, the widely used ROCKSDB interface is a superset of the LevelDB interface, so most systems and applications that use ROCKSDB can easily be adapted to TERARKDB.

Terark official provides terarkdb to MongoDB adaptation, to MySQL and other distributed database system adaptation also during the tense development process, the stable version of the Mongoterark product has been scheduled to be released in the near future.

TERARKDB Performance Test Report

This section is from Terark official website to view the original content

Directory
    • 1. Environment
      • 1.1. Server information
      • 1.2. Compare objects
      • 1.3. Test Data set
      • 1.4. Test the source code
      • 1.5. Compression Rate Description
    • 2.Tests
      • 2.1. Random Read Test
      • 2.2. Random Write Test
      • 2.3. Read-Write mixed test
      • 2.4 Read delay Test
1. Environment 1.1. Server information
Indicators Description
Cpu Intel (R) Xeon (r) CPU e5-2630 v3 @ 2.40GHz (2 x 8 physical cores)
Memory Up to GB of DDR4 RAM
Ssds Intel? SSD 520 Series (480GB, 2.5in SATA 6gb/s, 25nm, MLC)
Linux Kernel 3.10.0-327.10.1.el7.x86_64
1.2. Compare objects
Product Name version company
Rocksdb v4.4 Facebook
Wiredtiger v2.8.0 Mongodb
Hyperleveldb v1.2.2
Leveldb v1.18 Google
1.3. Test Data set

Amazon Movie Data (million reviews) with an average length of approximately 1K per piece of data

Raw data format
product/productId: B00006HAXWreview/userId: A1RSDE90N6RSZFreview/profileName: Joseph M. Kotowreview/helpfulness: 9/9review/score: 5.0review/time: 1042502400review/summary: Pittsburgh - Home of the OLDIESreview/text: I have all of the doo wop DVD‘s and this one is as good or better than the1st ones. Remember once these performers are gone, we‘ll never get to see them again.Rhino did an excellent job and if you like or love doo wop and Rock n Roll you‘ll LOVEthis DVD !!
Meta data (column name)
    • Because TERARKDB has Schema, you do not need to save metadata in each record (column name)
    • To be fair, insert a separator between columns (fields) for other databases, and do not save column names
Data set Size

moviesThe total size of the dataset is about 9GB , and the number of records is approximately800万条

1.4.Benchmark Source Code

Benchmark source code See GitHub Warehouse

1.5.Compression Ratio
    • TERARKDB data compression using a compression algorithm developed by itself
    • Other databases use block compression, block size is 4KB, and the compression algorithm is set to snappy
    • We use a randomly written test case to compare the size of the data that was written and compressed

2.Tests

All read operations are random queries on a single record. All write operations are also randomly inserted or updated on a single record.

2.1.Random Read
    • All data is pre-written to the file system
    • All database write operations are enabled for compression, and the Rocksdb/leveldb/wiredtiger usage algorithm is configured snappy
    • TERARKDB uses our own proprietary compression algorithm, does not require block compression, other databases use the default block size of 4KB (block size)
2.1.1. Data is less than memory

In this case our memory is large enough to load all the data into memory while the TERARKDB does not require a proprietary cache, but other databases require a proprietary cache (primarily used to cache the extracted data for block compression), and we set the private cache setting to 3GB for these databases.

At the same time this test we do not limit the operating system memory usage (total memory 64GB), the amount of data is much smaller than memory, the operating system can cache all the data.

We can see that TERARKDB is better than other databases in this case:

    • TERARKDB uses self-developed data compression algorithm, can directly extract a single record, do not need the traditional database block compression/decompression
    • TERARKDB uses self-developed succinct compressed data structures as an index, uses less memory, and searches faster
2.1.2. Data slightly larger than memory

When the amount of data is not fully loaded into memory, we need to store the data on a physical disk (we use SSDs as storage media here).

    • The physical memory that the operating system can use is limited to 8GB
    • We set up a dedicated cache of 1GB for other databases to load hot data
    • All databases are warmed up (terarkdb open mmap populate, other databases are read-ahead)

In this case, the advantages of TERARKDB are more obvious:

    • In addition to TERARKDB, other databases need to use block compression, in the case of random read, even if there is cache support, but after all, the size of the cache is limited, it is not possible to load all the data into the cache, which will lead to frequent disk I/O, reduce read performance
    • TERARKDB compression ratio is high, compressed data can be loaded into memory, while TERARKDB can directly access the compressed data, so that the advantages of terarkdb more obvious
    • Other databases because of the use of proprietary caches, when the data read far beyond the cache capacity, will cause a lot of data in and out, adding additional resource overhead
2.1.3. Data is much larger than memory
    • Operating system memory limit is 3G
    • Set up a dedicated 256M cache for other databases
    • All databases are warmed up (terarkdb open mmap populate, other databases are read-ahead)

Because TERARKDB is much higher than the data in other databases, this image uses logarithmic coordinates to make it easier to see the order of magnitude (see the vertical axis)

2.2.Random Write
    • Compression is turned on for all databases when writing, and the default block compression size is 4KB (TERARKDB does not require block compression)
    • All write Buffer is set to 256M
    • Simultaneous operation using 1/3/6 threads on write
2.2.1. Data is less than memory

The environment for random write tests and random read tests is similar:

    • Storage media uses a memory file system (that is, the data is pre-read into the memory file system to speed up testing)
    • Operating system memory is not limited
    • In addition to TERARKDB, set up a dedicated 3GB cache for other databases

2.2.2. Data slightly larger than memory

Similar to the environment for random read tests:

    • The total memory limit for the operating system is 8GB
    • In addition to TERARKDB, the private cache for other databases is set to 1GB
    • Data storage media with SSD
    • Write buffer set to 256M

Test results on SSDs more realistically reflect the impact of disk I/O on performance:

    • The TERARKDB is written in an indexed and data-separated manner, enabling the writing of data to be converted to sequential write in some degree
2.2.3. Data is much larger than memory
    • Operating system memory limit is 3G
    • Set up a dedicated 256M cache for other databases

2.3.read-write Mixed
    • TERARKDB is mainly used in a small number of read-write scenes
    • A total of 8 threads were used in the test, where each thread was randomly read and written, and 95%/99% of the time was reading
    • Write operation all compression enabled, the size of block compression is 4KB
    • Start with a random read of other databases (warm up) and populate the dedicated cache
2.3.1. The amount of data is less than memory
    • Storage media uses a memory file system (that is, the data is pre-read into the memory file system to speed up testing)
    • Operating system memory is not limited
    • In addition to TERARKDB, the private cache for other databases is set to 3GB

2.3.2. Data slightly larger than memory
    • Storage media changed to SSD
    • Operating system memory limit is 8GB
    • Private cache for other databases is set to 1GB
    • Test 99% Read and 95% read separately

2.3.3. Data is much larger than memory
    • Operating system memory limit is 3G
    • Set up a dedicated 256M cache for other databases
    • All databases are warmed up (terarkdb open mmap populate, other databases are read-ahead)

Similarly, due to the magnitude difference, we look at the data by logarithmic coordinates:

2.4 Read Latency Test

The data set in this test is still 9G movie review data, only the Read Query delay is tested, no Write operation is in the test.

Because the TERARKDB compression rate is very high, the system memory 3G can be loaded with all the data (actually compressed data only 2.1G, but the test program itself to account for about 750M of memory), so the following 3 sets of comparisons, the terarkdb are in the 3G memory under the conditions of testing. For Rocksdb and Wiredtiger, we tested them in 8g,4g and 3G memory respectively. In all tests, we used 8 threads.

2.4.1. Data slightly larger than memory
    • 8G Physical Memory (TERARKDB is 3G)
    • Other databases have a 512M dedicated cache

Average Median 99th Percentile StdDev
Rocksdb 40.86 24 300
Wiredtiger 58.82 41 450
Terarkdb 6.66 6 25

    • The horizontal axis indicates the delay, the unit of the number is microseconds, and the coordinate scale is approximate 对数
      • A closer look at the number of the horizontal axis will reveal a much lower terarkdb delay.
    • Ordinate indicates the percentage of total query number of cumulative query in interval
    • Point ( X, Y% ) indicates a query with a delay of less than X microseconds for the total number of queryY%
    • Data results, the faster you reach 100%, the better the Query delay performance (lower latency)
    • In the current situation, the memory is sufficient for all databases, so the curve is smoother
    • TERARKDB's latency mean, median, standard deviation, 99-bit value have obvious advantages, latency is stable.
2.4.2. Data is much larger than memory
    • 3G Physical Memory
    • Other databases have 256M of proprietary caching

Average Median 99th Percentile StdDev
Rocksdb 1338.88 1210 5000
Wiredtiger 273.06 353 600
Terarkdb 6.67 6 25
    • Other databases have a two-segment skew curve that represents the delay between the read data hit memory and the absence of hit memory, which is basically the point at which the cache is hit.
    • TERARKDB delay is much lower, terarkdb of the latency mean, median, standard deviation, 99-bit value has obvious advantages, latency is stable
    • In this case, although the total memory is only 3g, but our compression ratio is high, the compressed data can be fully loaded into memory, so there will be no cache misses
2.4.3 We also tested the indicators for ROCKSDB and Wiredtiger under 4G memory conditions:

Average Median 99th Percentile StdDev
Rocksdb 964.21 970.36 2500
Wiredtiger 204.85 56.25 600
Terarkdb 6.67 6 25

    • We can see that in the case of 4G memory, Rocksdb and Wiredtiger have a higher rate of cache hit operations (middle horizontal line)
Technical analysis

TERARKDB used very advanced and complex technology, and also applied for 4 patents. Its core technology is fundamentally different from other database products such as B + Tree, LSM tree, and block compression technology. The benefit is that the compression ratio and performance are greatly improved, not simple time-space interchange. This article briefly introduces a few technical points, more technical details please go to terark.com to view the document.

Not "space-for-time" or "time-to-space" existing technology

The existing mainstream database is also using compression technology, but they are mainly on 时间与空间的折衷 : The compression method is the use of universal compression technology 按块/页(block/page)压缩 (block size is usually 4k~32k, the compression rate known as the TOKUDB block size is 2m~4m).

When compression is enabled, it follows that 访问速度下降 this is because:

    • When writing, many records are packaged together to compress into blocks, increasing the size of the block, the compression algorithm can get a larger context, and 提高压缩率 Conversely, reducing the block size decreases the compression ratio.

    • Read, and even 读取很短的数据,也需要先把整个块解压 then read the extracted data. Thus, the larger the block size, the greater the number of records contained within the same block, the more unnecessary decompression is done to read a single piece of data, and the worse the performance will be. Conversely, the smaller the block size, the better the performance.

Once the compression is enabled, in order to alleviate the above problems, the traditional database generally need relatively large 专用缓存,用来缓存解压后的数据 , so that can be large 提高热数据的访问性能 , but also caused by 双缓存 the space occupancy problem, one is 操作系统缓存中的压缩数据 , two 专用缓存中解压后的数据 . There is also a very serious problem: after all, the 专用缓存 cache, when the cache misses, still need to extract the entire block, this is 慢Query a source of the problem; Another source of slow query is when the operating system cache misses ...

The Btree index of a traditional database also occupies a larger space, because the compression rate typically used by Btree 前缀压缩 is very low.

All these lead to the existing traditional database on 访问速度 and 空间占用 on is a problem that can not be solved completely, only to make such a compromise.

Terark's technology is fundamentally different from the existing database

For data compression (which can be considered as the compression of value in Key-value), TERARKDB mainly uses its own research and development of the database-specific 全局压缩 technology, compression rate is higher, and there is no concept of block compression, there is no problem of double caching. This compression technology can press Rowid/recordid Direct 读取单条数据 , if this 读取单条数据 is regarded as an decompression, then, according to RowID 顺序 decompression, the decompression speed is generally 500MB per second (single-threaded), up to about 7gb/s; RowID 随机 when decompressed, the decompression speed is generally 300MB per second (single thread), up to about 3gb/s.

For index compression, Terark mainly uses Succinct technology, the compression rate is higher than the existing technology, and, in addition to compression, the 不用解压就可以高效地执行搜索 index can be supported 正则表达式搜索 (without having to iterate through matching regular expressions). This technology-based Succinct Index also has additional support for 反向搜索 : Forward is to get RowID from key, reverse search is to get key from RowID, so that key does not need to store one copy (traditional Btree index does not have this function). This provides a technical fulcrum for TERARKDB to support multiple indexes on the same Table.

Succinct technology has been around for a long time, but since performance issues have not been widely used, Terark succinct technology has been specifically optimized at the CPU instruction level, significantly improving succinct performance.

It is the use of these new technologies, the compression rate and access speed of TERARKDB is greatly improved, and the function is very rich.

TERARKDB Database Schema

The TERARKDB database contains multiple segment, which can be divided into writing segment,writable frozen segment, as well as ReadOnly segment, according to segment status. The data is written to writing segment first, and the data in this segment can be updated and retrieved directly. When the data is written to a certain size, the writing segment becomes writable frozen segment, and begins to be compressed by the background thread. When the background compression is complete, readonly segment is generated and writable frozen segment is deleted. In addition, the physical deletion of data, segment merging, and so on, are also performed in a background thread. Eventually, most of the data will be in the ReadOnly segment, resulting in very high compression rates and access performance.

Automata Technology and succinct technology

With Terark at the same time in the engineering succinct technology and the famous Berkeley Amplab Laboratory, Spark was born in this laboratory. Terark has its own advantages in algorithms, data structures and engineering techniques.

There are a lot of applications of automata technology in TERARKDB, the self-motive is a state transfer diagram, which is used to express data, along the edges of the graph, to access the nodes according to certain rules, so that the required data can be extracted. Using traditional techniques to store this graph, memory consumption is large, and Terark uses succinct technology to compress this state transition diagram. The essence of succinct technology is to use bitmap to represent data structures, and memory usage is greatly reduced while maintaining fast access performance. On the other hand, because it is based on automata, it is possible to natively support regular expression retrieval.

Conclusion

Welcome to download the use of Terark products. Future Terark plans to port the core engine to more distributed systems for more scenarios, such as Elastic Search,spark, mobile phones and embedded devices. Terark at this stage of the plan is to find more research and development and business cooperation, the product to the market as soon as possible. We are currently hiring and interested friends can contact us directly. You can also visit the official website for more information.

Performance reporting and technical analysis of the TERARKDB database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.