Overview
Rocksdb as a KV storage engine, the Myrocks records are eventually stored in the ROCKSDB as KV. Tables in MySQL typically consist of several indexes, and in the InnoDB storage engine, each index corresponds to a B-tree, whereas in the Rocksdb storage engine, the index corresponds to a contiguous range of data in Rocksdb.
Specifically, this range is all the data between this index ID and id+1. If all the indexes of a table are in a column family, the index data of the table is basically contiguous in physics.
You can refer to the illustrations in the previous article
Myrocks record Format
Myrocks stores all the indexes of a table in Rocksdb, in indexes.
The format of the Myrocks record differs depending on the type of index. The following table shows different index types for each
CREATE TABLE T1 (a INT, b VARCHAR ( -), CChar(5), Dint, PK INT auto_increment, PRIMARY KEY (PK) Comment'cf_1', Unique key idx2 (b) Comment'cf_2') Engine=Rocksdb;insert into T1 (pk,a,b,c) VALUES (1,1,'bbbbbbbbbb','C');
Primary key
The primary key index records the KV structure as follows
key:index_id, M (PK) value:unpack_info, NULL-bitmap,b,c,d
Key consists of an index ID and a primary key. INDEX_ID is the unique identifier of the index occupies 4 bytes, M (PK) represents the data after the PK conversion, the converted data can be directly used for memcmp comparison
Rocksdb data are sorted according to key, in order to facilitate comparison, different types of data will undergo some conversion, after conversion can be directly used for memcmp comparison.
For memcmp conversions, the next section will explain
Value stores Unpack_info and non-primary key field data, Null-bitmap identifies which fields are empty.
Unpace_info stores the information that the M (PK) is reversed to PK, unpace_info is null if no additional conversion information is required, in this case the PK is of type int, no additional information is required unpace_info null
Second-level index IDX2
The second-level index records the KV structure as follows
key:index_id,null-byte, M (b), M (PK) Value:unpack_info
The key consists of a index_id, a two-level index key, and a primary key, where Null-byte indicates whether B is empty. PK primary key is not empty, so do not need null-byte
Value is only Unpack_info, which represents M (b), M (PK) inverse conversion information, and unpace_info is null if no additional conversion information is required. In this example, B is a varchar type and additional information is required unpace_info NOT NULL
There is no difference between a unique index and a normal two-level index storage method
Federated indexes Each field adds a null-byte to the field to indicate whether this field is empty
memcomparable format
Rocksdb for convenience, convert the key field into a form that can be directly memcmp compared. So myrocks generally recommends using sensitive collations (Latin1_bin, utf8_bin, binary).
This avoids the cost of conversion.
Plastic conversion is simple, but special handling is required for signed types, and if direct storage causes comparisons to be negative than positive numbers.
The way to handle signed types here is to reverse the sign bit, so that positive numbers are bigger than negative numbers.
The key code snippet is as follows
Field_long::make_sort_key:if(!table->s->Db_low_byte_first) { if(unsigned_flag) to[0] = ptr[0]; Elseto[0] = (Char) (ptr[0] ^ -);/*revers signbit*/to[1] = ptr[1]; to[2] = ptr[2]; to[3] = ptr[3]; }
Char Type direct fill space
varchar types are much more complicated to save space.
Take annotations in source code as an example
Const intVarchar_cmp_less_than_spaces =1;Const intVarchar_cmp_equal_to_spaces =2;Const intVarchar_cmp_greater_than_spaces =3; Example:ifFpi->m_segment_size=5, and the collation isLatin1_bin:'abcd\0'= ['ABCD'<VARCHAR_CMP_LESS>] [' /'<VARCHAR_CMP_EQUAL> ] 'ABCD'= ['ABCD'<VARCHAR_CMP_EQUAL>] 'ABCD'= ['ABCD'<VARCHAR_CMP_EQUAL>] 'abcdzzzz'= ['ABCD'<varchar_cmp_greater>]['ZZZZ'<varchar_cmp_equal>]
The string is stored in m_segment_size fragment, the first m_segment_size-1 character of each paragraph is the content, the last character is represented by a space comparison, and the varchar_cmp_equal also indicates the end of the string
In the example, M_segment_size is 5 and the actual implementation value is 9.
Here Unpace_info will be more complex, the string collation different unpace_info also different, unpace_info need to save the transformation between collation mapping relationship,
You can see the function in detail (rdb_init_collation_mapping)
ROCKSDB Internal record format
What we saw earlier is the KV structure recorded before entering the ROCKSDB, in fact the data is stored to ROCKSDB after the key is further encapsulated
The key before entering Rocksdb is called UserKey, rocksdb internal is called Internalkey
internalkey=| User key (string) | Sequence Number (7 bytes) | Value type (1byte) |
Where sequence number is the record sequence, each record sequence numbers is incremented according to the order in which records enter ROCKSDB.
Sequence number is the key to implementing ROCKSDB transaction processing, this next discussion.
Value type is the type of record, put, merge,delete, etc.
Example
Use examples to illustrate the more intuitive, or the table described above, insert a record, to see the specific structure of the record
INSERT into T1 (pk,a,b,c) VALUES (1,1,'bbbbbbbbbb','C ');
View PRIMARY key index_id is 260, Level two index index_id is 261
select * from information_schema. ROCKSDB_DDL where table_name= " t1 " ; TABLE_SCHEMA table_name partition_name index_name column_family index_number index_type KV_FORMAT_ VERSION cftest T1 NULL PRIMARY 2 260 1 11 Cf_1test T1 NULL idx2 3 261 2 11 cf_2
Here is the space information for field B and the collation transformation mapping relationship. More complex, non-detailed expansion, interested in the ability to view functions (rdb_init_collation_mapping)
Myrocks Record Format analysis