Myrocks Record Format analysis

Source: Internet
Author: User

Overview

Rocksdb as a KV storage engine, the Myrocks records are eventually stored in the ROCKSDB as KV. Tables in MySQL typically consist of several indexes, and in the InnoDB storage engine, each index corresponds to a B-tree, whereas in the Rocksdb storage engine, the index corresponds to a contiguous range of data in Rocksdb.
Specifically, this range is all the data between this index ID and id+1. If all the indexes of a table are in a column family, the index data of the table is basically contiguous in physics.
You can refer to the illustrations in the previous article

Myrocks record Format

Myrocks stores all the indexes of a table in Rocksdb, in indexes.
The format of the Myrocks record differs depending on the type of index. The following table shows different index types for each

CREATE TABLE T1 (a INT, b VARCHAR ( -), CChar(5), Dint, PK INT auto_increment, PRIMARY KEY (PK) Comment'cf_1', Unique key idx2 (b) Comment'cf_2') Engine=Rocksdb;insert into T1 (pk,a,b,c) VALUES (1,1,'bbbbbbbbbb','C');

    • Primary key
      The primary key index records the KV structure as follows

      key:index_id, M (PK) value:unpack_info, NULL-bitmap,b,c,d

      Key consists of an index ID and a primary key. INDEX_ID is the unique identifier of the index occupies 4 bytes, M (PK) represents the data after the PK conversion, the converted data can be directly used for memcmp comparison

      Rocksdb data are sorted according to key, in order to facilitate comparison, different types of data will undergo some conversion, after conversion can be directly used for memcmp comparison.
      For memcmp conversions, the next section will explain

      Value stores Unpack_info and non-primary key field data, Null-bitmap identifies which fields are empty.
      Unpace_info stores the information that the M (PK) is reversed to PK, unpace_info is null if no additional conversion information is required, in this case the PK is of type int, no additional information is required unpace_info null

    • Second-level index IDX2
      The second-level index records the KV structure as follows

      key:index_id,null-byte, M (b), M (PK) Value:unpack_info

      The key consists of a index_id, a two-level index key, and a primary key, where Null-byte indicates whether B is empty. PK primary key is not empty, so do not need null-byte
      Value is only Unpack_info, which represents M (b), M (PK) inverse conversion information, and unpace_info is null if no additional conversion information is required. In this example, B is a varchar type and additional information is required unpace_info NOT NULL

      There is no difference between a unique index and a normal two-level index storage method
      Federated indexes Each field adds a null-byte to the field to indicate whether this field is empty

memcomparable format

Rocksdb for convenience, convert the key field into a form that can be directly memcmp compared. So myrocks generally recommends using sensitive collations (Latin1_bin, utf8_bin, binary).
This avoids the cost of conversion.

    • Plastic

Plastic conversion is simple, but special handling is required for signed types, and if direct storage causes comparisons to be negative than positive numbers.
The way to handle signed types here is to reverse the sign bit, so that positive numbers are bigger than negative numbers.
The key code snippet is as follows

Field_long::make_sort_key:if(!table->s->Db_low_byte_first) {  if(unsigned_flag) to[0] = ptr[0]; Elseto[0] = (Char) (ptr[0] ^ -);/*revers signbit*/to[1] = ptr[1]; to[2] = ptr[2]; to[3] = ptr[3]; }
    • Character type

Char Type direct fill space

varchar types are much more complicated to save space.
Take annotations in source code as an example

Const intVarchar_cmp_less_than_spaces =1;Const intVarchar_cmp_equal_to_spaces =2;Const intVarchar_cmp_greater_than_spaces =3; Example:ifFpi->m_segment_size=5, and the collation isLatin1_bin:'abcd\0'= ['ABCD'<VARCHAR_CMP_LESS>] [' /'<VARCHAR_CMP_EQUAL> ]  'ABCD'= ['ABCD'<VARCHAR_CMP_EQUAL>]  'ABCD'= ['ABCD'<VARCHAR_CMP_EQUAL>]  'abcdzzzz'= ['ABCD'<varchar_cmp_greater>]['ZZZZ'<varchar_cmp_equal>]

The string is stored in m_segment_size fragment, the first m_segment_size-1 character of each paragraph is the content, the last character is represented by a space comparison, and the varchar_cmp_equal also indicates the end of the string

In the example, M_segment_size is 5 and the actual implementation value is 9.

Here Unpace_info will be more complex, the string collation different unpace_info also different, unpace_info need to save the transformation between collation mapping relationship,
You can see the function in detail (rdb_init_collation_mapping)

ROCKSDB Internal record format

What we saw earlier is the KV structure recorded before entering the ROCKSDB, in fact the data is stored to ROCKSDB after the key is further encapsulated
The key before entering Rocksdb is called UserKey, rocksdb internal is called Internalkey

internalkey=| User key (string) | Sequence Number (7 bytes) | Value type (1byte) |

Where sequence number is the record sequence, each record sequence numbers is incremented according to the order in which records enter ROCKSDB.
Sequence number is the key to implementing ROCKSDB transaction processing, this next discussion.

Value type is the type of record, put, merge,delete, etc.

Example

Use examples to illustrate the more intuitive, or the table described above, insert a record, to see the specific structure of the record

INSERT into T1 (pk,a,b,c) VALUES (1,1,'bbbbbbbbbb','C  ');

View PRIMARY key index_id is 260, Level two index index_id is 261

 select  * from  information_schema. ROCKSDB_DDL where  table_name= " t1   " ; TABLE_SCHEMA table_name partition_name index_name column_family index_number index_type KV_FORMAT_ VERSION cftest T1 NULL PRIMARY  2  260  1  11   Cf_1test T1 NULL idx2  3  261  2  11  cf_2
    • Primary KEY record

      • Key

      • Value

    • Second-level index records

      • Key

      • Value

Here is the space information for field B and the collation transformation mapping relationship. More complex, non-detailed expansion, interested in the ability to view functions (rdb_init_collation_mapping)

Myrocks Record Format analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.