MyRocks DDL principle and myrocksddl Principle

Source: Internet
Author: User

MyRocks DDL principle and myrocksddl Principle

In the DDL process of a recent daily instance, the database was directly dumped, and the problem was still serious. So I checked the problem and pulled down the crash stack and alert logs, it was found that MyRocks had a serious bug in the case of removing the unique constraint. Therefore, a bug was urgently raised to the official website. In fact, the problem is relatively hidden, because a DDL statement is used directly, and the database will not be suspended, but will only occur after multiple operations on the same index under certain circumstances, therefore, it takes some time to troubleshoot the problem. The specific bug troubleshooting and reproduction process are not shown here. If you are interested, you can go to the bug link: https://github.com/facebook/mysql-5.6/issues/602. With the opportunity of troubleshooting, I sorted out the MyRocks DDL workflow. The following content mainly covers three aspects: MyRocks data dictionary. DDL operations not only modify the data itself, A very important task is to maintain the data dictionary. The second part is the MyRocks DDL process, which focuses on the scenario of adding/deleting indexes. The last part is to analyze the DDL Exception Handling logic.

Data Dictionary
The so-called data dictionary is the place where the engine metadata is stored. The data dictionary can be viewed from two dimensions. From the user perspective, the data dictionary is in the information_schema table.
RocksDB-related tables, including ROCKSDB_DDL and ROCKSDB_INDEX_FILE_MAP. From the perspective of the internal implementation of RockDB, all metadata is stored in the system column family in KV pairs. The table information in information_schema is constructed by using the metadata in the system column family. At the same time, when mysqld is started, a metadata is constructed and stored in the memory, quick Search and query. Below I will list several types of RocksDB data dictionary and list the KV pairs of each type.
// Data dictionary types

Enum DATA_DICT_TYPE {DDL_ENTRY_INDEX_START_NUMBER = 1, // indexing between a table and an index INDEX_INFO = 2, // index CF_DEFINITION = 3, // column familyBINLOG_INFO_INDEX_NUMBER = 4, // binlog Point Information DDL_DROP_INDEX_ONGOING = 5, // Delete the index dictionary task INDEX_STATISTICS = 6, // index statistics MAX_INDEX_ID = 7, // the current maximum index_idDDL_CREATE_INDEX_ONGOING = 8, // Add the index dictionary task END_DICT_INDEX_ID = 255 };

1). DDL_ENTRY_INDEX_START_NUMBER
Ing between tables and Indexes
Key: Rdb_key_def: DDL_ENTRY_INDEX_START_NUMBER (0x1) + dbname. tablename
Value: version + {global_index_id} * n_indexes_of_the_table

2). INDEX_INFO
Relationship between index id and index attribute
Key: Rdb_key_def: INDEX_INFO (0x2) + global_index_id
Value: version, index_type, key_value_format_version

Index_type: primary key/secondary index/implicit primary key
Key_value_format_version: record the storage format version

3). CF_DEFINITION
Column family attributes
Key: Rdb_key_def: CF_DEFINITION (0x3) + cf_id
Value: version, {is_reverse_cf, is_auto_cf}

Is_reverse_cf: whether it is reverse column family
Is_auto_cf: whether the column family name is $ per_index_cf. The name is automatically composed of table. indexname.

4). BINLOG_INFO_INDEX_NUMBER
Binlog point and gtid information. binlog_commit updates this information.
Key: Rdb_key_def: BINLOG_INFO_INDEX_NUMBER (0x4)
Value: version, {binlog_name, binlog_pos, binlog_gtid}

5). DDL_DROP_INDEX_ONGOING
Index deletion task
Key: Rdb_key_def: DDL_DROP_INDEX_ONGOING (0x5) + global_index_id
Value: version

6). INDEX_STATISTICS
Index statistics
Key: Rdb_key_def: INDEX_STATISTICS (0x6) + global_index_id
Value: version, {materialized PropertiesCollector: IndexStats}

7). MAX_INDEX_ID
The current index_id. Each time an index index_id is created, it is obtained and updated from this
Key: Rdb_key_def: CURRENT_MAX_INDEX_ID (0x7)
Value: version, current max index id

8). DDL_CREATE_INDEX_ONGOING
Index task to be created
Key: Rdb_key_def: DDL_CREATE_INDEX_ONGOING (0x8) + global_index_id
Value: version

DDL Process
The RocksDB engine does not have an incremental row_log mechanism similar to the InnoDB engine. Therefore, MyRocks does not support Online DDL, but only supports the inplace method for some ddl operations. We can see from the implementation of the check_if_supported_inplace_alter interface, the DROP_UNIQUE_INDEX and ADD_INDEX operations can be performed through the inplace method. The advantage of the inplace method is that the table does not need to be copied, which indirectly reduces the time of table lock, other operations can only be performed by recreating a table. The following describes the DDL Execution Process in inplace mode. The copy method is simpler. The total entry function is mysql_inplace_alter_table, which consists of four stages.
1) check whether the storage engine supports inplace DDL operations
Interface: ha_rocksdb: check_if_supported_inplace_alter
MyRocks supports the inplace method. The operation type is HA_ALTER_INPLACE_SHARED_LOCK_AFTER_PREPARE, which means that writing is blocked during DDL execution.
2). Preparation Phase
Interface: ha_rocksdb: prepare_inplace_alter_table
For the RocksDB engine, the inplace method is mainly used to add and delete indexes. Therefore, this process is mainly used to collect the index information to be added or deleted. The specific entry function involved in data dictionary operations is create_key_defs, which calls the create_key_def interface. Each index corresponds to an Rdb_key_def object. A major operation involved here is to generate a globally ordered index_id (ddl_manager.get_and_update_next_number) for the index ).

3). Execution phase
Interface: ha_rocksdb: inplace_alter_table
Here we mainly add secondary indexes. The specific implementation is on the inplace_populate_sk interface. It consists of two parts: updating the data dictionary and creating an index.
A. Update the data dictionary
Data Dictionary maintenance is completed through the start_ongoing_index_operation interface. KV pairs are constructed for the new index and written into the system column family.
, All KV pairs of the added index will act as a transaction commit, indicating a batch of tasks to be created for the index.

beginput-KV:(DDL_CREATE_INDEX_ONGOING,cf_id,index_id)->(DDL_CREATE_INDEX_ONGOING_VERSION)commit

B. Create an index
The next step is to create an index. By traversing the PK index, we can create a new format record for adding a secondary index, and then write the index. The main interface is update_sk. In RockDB row lock implementation, each key corresponds to a lock and the lock object cannot be reused. Therefore, the total memory consumed by the lock is related to the key size and number of keys, to ensure that the system memory is controllable, rocksdb_commit_in_the_middle is generally enabled to avoid large transactions. Therefore, this process will also trigger the check of whether to submit the transaction in advance. The main implementation interface is in do_bulk_commit.

4). Submission or rollback phase
Interface: commit_inplace_alter_table
A. process the index to be deleted and use the start_ongoing_index_operation (drop) interface.
B. Write the index dictionary information for new indexes.
C. ing between writing tables and Indexes
After the alter operation is performed on the table, some indexes will be added and some indexes will be deleted. Therefore, the index relationship of the table needs to be rebuilt. the main interface is in Rdb_tbl_def: put_dict.
1st), 2), 3) the dictionary operation involved is committed as a transaction.

beginput-KV: (DDL_DROP_INDEX_ONGOING,cf_id,index_id)->(DDL_DROP_INDEX_ONGOING_VERSION)put-KV: (INDEX_INFO+cf_id+index_id)->INDEX_INFO_VERSION_VERIFY_KV_FORMAT+index_type+kv_versionput-KV: (DDL_ENTRY_INDEX_START_NUMBER,dbname_tablename)->version + {key_entry, key_entry, key_entry, ... } ,key_entry --> (cf_id, index_nr)commit

D. Maintain the data dictionary in the memory object m_ddl_hash.
The main task is to remove the old tbl object from the hash table and write the new tbl object. The main implementation interface is in Rdb_ddl_manager: put.

E. Clear the DDL_CREATE_INDEX_ONGOING mark.
The normal execution here indicates that the newly created index has been successfully executed. You need to clear the DDL_CREATE_INDEX_ONGOING mark. The main implementation interface is in finish_indexes_operation, and finally the end_ongoing_index_operation is called to delete the previously added KV pairs.
(DDL_CREATE_INDEX_ONGOING, cf_id, index_id)-> (DDL_CREATE_INDEX_ONGOING_VERSION), and use the entire operation as a transaction commit. We can see that the entire process has been completed, but we have not seen where the deleted index is actually cleared. Deleting the index in RocksDB is actually an asynchronous process, the real index deletion operation is completed by the background thread Rdb_drop_index_thread. So here, we will trigger an action to wake up rdb_drop_idx_thread and inform the thread that it has been active.

Rdb_drop_index_thread Workflow
1). Get the index list to be deleted key = (DDL_DROP_INDEX_ONGOING)
2) traverse each index to be deleted one by one and delete the records according to the key range (index_id, index_id + 1 ).
3). Call CompactRange to trigger merging.
4). Search for the key through index_id. If the key with the same index-id does not exist, the index is deemed to have been cleared.
5) Finally, call finish_indexes_operation (DDL_DROP_INDEX_ONGOING) to clear the index tag to be deleted and delete the index dictionary information from the data dictionary. For details, refer to delete_index_info.

begindelete-key: (DDL_DROP_INDEX_ONGOING,cf_id,index_id)delete-key: (INDEX_INFO+cf_id+index_id)batch-commit

DDL Exception Handling
From the above implementation, we execute a DDL operation. Apart from the transaction of index operations, there are also several transactions involving data dictionary operations. Therefore, the entire DDL operation is not an atomic operation. For example, after the dictionary-related operations are submitted in step 1 of the execution phase and the instance crash is completed, the dictionary operations are left in the system Column family. However, from the business perspective, it does not affect. The mysql_inplace_alter_table described above contains the main execution process of DDL. In fact, a temporary table will be created through mysql_prepare_alter_table to define the frm file (the file name generally starts with # SQL ), this file contains the schema definition of the target table. At the end of the DDL operation, update it to the target table name through mysql_rename_table. frm. If the instance crash is completed before rename, the frm file content will remain in the old version, but the RocksDB engine dictionary has been updated. In terms of the representation, we can see that the index content displayed in show create table xxx is inconsistent with that in information_schema.ROCKSDB_DDL. The two cases discussed above are all problems caused by inplace. For copy mode, the temporary table # sqlxxx information is written into the data dictionary due to the need to recreate the table. If this action is completed, instance crash may cause temporary table information to be left in the data dictionary. When mysqld is restarted, the system checks whether the dictionary exists based on the dictionary information. It is mainly implemented through the validate_schemas interface. Specifically, the corresponding frm file is searched through the table name in the data dictionary, the temporary frm file starting with # is ignored during the search process. As long as the data dictionary contains the dictionary information of the temporary table, mysqld startup fails and the following error is reported.

error:[Warning] RocksDB: Schema mismatch - Table test.#sql-b54_1 is registered in RocksDB but does not have a .frm file[ERROR] RocksDB: Problems validating data dictionary against .frm files, exiting[ERROR] RocksDB: Failed to initialize DDL manager.

If you want to start the service properly, you can temporarily ignore this error using the parameter rocksdb_validate_tables = 2. After all, the data dictionary of the temporary table does not affect the use of the business table. According to my analysis, DDL does not handle exceptions well. The root cause is that DDL is not an atomic operation, modifications to the server layer and engine layer cannot be consistent in some cases, leading to problems.

Related Implementation files and interfaces
Storage/rocksdb/rdb_datadic.cc // data dictionary code
Storage/rocksdb/rdb_ I _s.cc // information_schema code
Myrocks: ha_rocksdb: inplace_populate_sk // update the secondary index
Rdb_dict_manager: get_max_index_id // obtain the maximum index_id
Ha_rocksdb: check_if_supported_inplace_alter // check whether inplace is supported
Myrocks: ha_rocksdb: create // copy method table Creation Interface
Myrocks: ha_rocksdb: create_key_def // create a key object
Myrocks: Rdb_ddl_manager: get_and_update_next_number // obtain the next index_id
Rdb_dict_manager: start_ongoing_index_operation // Add a task for creating or deleting indexes.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.