Rocksdb engine record format

Source: Internet
Author: User

Rocksdb engine record format

Rocksdb is a kv engine improved by the facebook team based on levelDB. Rocksdb uses LSM-tree to store data. It is very popular with its excellent read/write and compression features. In addition, the Rocksdb engine is a plug-in that has been integrated into the MySQL Branch maintained by facebook. You can access rocksDB through SQL. This article mainly analyzes the record format of the Rocksdb engine and compares innodb to show you about Rocksdb. As a kv engine, Rocksdb writes data to the key through put (key, value) or get the value through the get (key) interface. For rocksdb itself, each record is a key-value. When Rocksdb is connected to MySQL as a storage engine, this article will discuss how the key-value structure stores the indexes in the table and how to record the information of each column. Similar to the innodb engine, the rocksdb engine uses indexes to organize tables. Both tables (primary key indexes) and secondary indexes are organized in the LSM tree mode. rocksdb records mainly include three parts: key, value and meta are described in the table below. Then, I will explain the problem by introducing a specific record in the storage format of rocksdb engine.

Rocksdb basic record storage format

Key_size

Key

Value_size

Value

PK/SecKey

Columns data

SeqenceId, flag

create table row_format(  id int not null,  c1 int,  c2 char(10) not null,  c3 char(10),  c4 varchar(10),  c5 varchar(10) not null,  c6 blob,  c7 binary(10) not null,  c8 varbinary(10)) engine=rocksdb;insert into row_format(id,c2,c4,c5,c7) values(1,'abc','abc','efg','111')

Key:

Index_id

Key

4 bytes

8 bytes

0x7fdfa4278ea0: 0x00 0x00 0x01 0x7b 0x00 0x00 0x00 0x00

0x7fdfa4278ea8: 0x00 0x00 0x00 0x05

Index_id: The index number, which is globally unique.

Rowid: because the table does not have a primary key, the system generates a rowid of the bigint type as the primary key, occupying 8 bytes, while the rowid of the innodb Engine occupies 6 bytes, it should be noted that the rowid storage uses the large-end storage (high storage and low byte), which is mainly used for memcompare.

Value section:

 

Null-flag

ID

C1

C2

C3

C4

C5

C6

C7

C8

Length

1B

4B

----

30B

----

4B

4B

----

10B

----

Value

 

1

 

Abc0x20...

 

Len + value

Len + value

 

1110x00...

 

0x7fdfa4251e50: 0x1b 0x01 0x00 0x00 0x00 0x61 0x62 0x63

0x7fdfa4251e58: 0x20 0x20 0x20 0x20 0x20 0x20 0x20 0x20

0x7fdfa4251e60: 0x20 0x20 0x20 0x20 0x20 0x20 0x20 0x20

0x7fdfa4251e68: 0x20 0x20 0x20 0x20 0x20 0x20 0x20 0x20

0x7fdfa4251e70: 0x20 0x20 0x20 0x03 0x61 0x62 0x63 0x03

0x7fdfa4251e78: 0x65 0x66 0x67 0x31 0x31 0x31 0x00 0x00

0x7fdfa4251e80: 0x00 0x00 0x00 0x00 0x00

Note:

Meta:

The Meta part is mainly SequenceID, which is generated when a transaction is committed. It is mainly used for rocksDB to implement MVCC and for visibility judgment. In addition, Meta also contains flag Information. Due to the marked record type, put, delete, singleDelete, etc. Specifically, Sequence occupies 7 bytes, and flag occupies 1 byte.

Rocksdb index format

In Rocksdb, all data is organized by indexes. Similar to Innodb, it is also an index organization table. Each index has a globally unique index_id. There are two types of indexes: primary key index and secondary index. The record format described earlier is the format of primary key index, including key, value, and meta. The secondary index also contains three parts: key, value, and meta. However, the value does not contain any data, but only contains checksum information.

Primary Key Index

Key

Value

Meta

Index_id

PK

NULL flag bit

Column data

Checksum (optional)

SeqId, flag

Secondary Index

Key

Value

Meta

Index_id

SecondaryKey

PK

Checksum (optional)

SeqId, flag

Comparison of innodb Engine (innodb_file_format = Barracuda, row_format = compact)

Innodb record format

Variable Length Field Length list

NULL flag bit

Record_header

Trxid

Roll_ptr

Column data

create table row_format(  id int not null,  c1 int,  c2 char(10) not null,  c3 char(10),  c4 varchar(10),  c5 varchar(10) not null,  c6 blob,  c7 binary(10) not null,  c8 varbinary(10)) engine=innodb;insert into row_format(id,c2,c4,c5,c7) values(1,'1234','ab','efg','111');

Record Content:

2017c0b0 00 00 03 02 0a 1b 00 00 18 ff b5 00 00 00 28 | ...... (|

2017c0c0 00 00 00 00 01 01 03 83 00 00 01 36 01 10 80 00 | .......... 6 ...... |

2017c0d0 00 01 31 32 33 34 20 20 20 20 20 61 62 65 66 | .. 1234 abef |

2017c0e0 67 31 31 31 00 00 00 00 00 00 00 00 00 00 | g111. ...... |

Note:

1. 03 02 0a. Here we store the length information. All non-null variable-length columns exist in reverse order. Here, the order is c5, c4, c2, innodb processes char (10) as a variable-length field.

2. 1b stores null information, which is consistent with rocksdb's null processing. 00 00 18 ff b5 stores the record-header.

3. 00 00 00 00 28 00 00 00 00 01 01 03 83 00 00 01 36 01 10, these three parts are rowid, trxid, and roll_ptr, which occupy 6 bytes respectively, 6 bytes and 7 bytes.

4. The last part is data. null does not occupy any storage space. It is similar to rocksdb processing. The difference is that for the processing of the char type, innodb adds the c2 char (10) field to 10 bytes and stores it as 31 32 33 34 20 20 20 20 20, it is processed as varchar, And the length information is recorded. Rocksdb is supplemented to 30 bytes (utf8 Character Set), which is processed as char without recording length information.

In general, the innodb record format contains record_header (record header information), which occupies 5 bytes, including record number (heap_no), number of columns, location of the next record, and whether to delete the record. Rocksdb is relatively simple, only the overall value-size, and the flag in the Meta indicates the record status put or delete. Innodb stores the length information of variable-length columns together, making it easy to search for any column. The variable-length column information of rocksdb is placed before each column, to access the last column, you must calculate the previous column one by one to locate it. In addition, because the innodb engine and the rocksdb engine have different MVCC implementation mechanisms, the additional information that the innodb engine and the rocksdb engine need to store is also different. The implementation of MVCC in Innodb depends on the rollback segment information. The record needs to store the trxid and roll_ptr fields, which are 6 bytes and 7 bytes (type, rsegid, pageNO, offset) respectively ), type occupies one bit, indicating the insert or update type. rsegid rollback segment id occupies 7 bit, pageNo occupies 4 bytes, and page offset occupies 2 bytes. The implementation of MVCC by Rocksdb depends on SequenceID. The SequenceID is used to determine the record visibility. SequenceID occupies 7 bytes.

In details, the RocksDB engine and innodb Engine process null, char, and varchar in a similar way. However, innodb optimizes the char type and uses it as a varchar. In addition, the rocksdb engine does not perform special processing on blob. You may be wondering if rocksdb does not have block_size. What if it is set to 16 k and blob data exceeds 16 k? For innodb, because the table is actually organized by pages through B-tree, each page is a fixed size, when the record is very large, you need to use overflow pages, associate by link. In rocksdb, block_size is only a compression unit and is not strictly restricted. The file content is organized in blocks. Because the block in the file may have been compressed, the size of each block is not fixed, the offset is used to locate the location of a specific block. If large blob data is encountered, the block may be relatively large, and all data is stored together without cross-block.

The index length limit is also different. For the innodb engine, the length of a single column in the index cannot exceed 767 bytes, and the length of a single column in The rocksdb engine cannot exceed 2048 bytes, for details, see the implementation of max_supported_key_part_length. The length of the entire index is limited to 3072 bytes for both rocksdb and innodb, which is actually a limitation on the server layer, because their respective limits are longer than those on the server layer. For more information, see max_supported_key_length.

References

Http://dev.mysql.com/doc/refman/5.7/en/innodb-physical-record.html

Http://hedengcheng.com /? P = 127

Http://www.cnblogs.com/zhoujinyi/articles/2726462.html

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.