MySQL source code: Innobase dictionary management and indexing bitsCN.com
I recently reviewed the B-tree and record module of innobase and found that I had forgotten the dictionary management and index content I had read before. so I read it again and recorded it, to avoid forgetting it one day. However, if you have any questions, please correct them. Thank you.
In innobase, there are four basic system tables used to store user-defined table, column, index, and index column information. These tables are SYS_TABLES, SYS_COLUMNS, SYS_INDEXES, and SYS_FIELDS respectively. The columns of each table are as follows:
1) SYS_TABLES is used to store all innobase-based tables. each record corresponds to a defined table.
NAME: NAME of a table.
ID: ID of the table (8 bytes ).
N_COLS: number of columns in the table. the number of columns specified in the table is 4 bytes ).
TYPE: indicates the storage TYPE of the table, including the record format, compression, and other information (4 bytes ).
MIX_ID: I don't know how to use it. it seems useless.
MIX_LEN: It seems useless.
CLUSTER_NAME: I don't know what to use. it seems useless.
SPACE: ID of the tablespace in which the table is located (4 bytes ).
The primary key corresponding to this table is (NAME), and there is also a unique index on the ID number.
2) SYS_COLUMNS is used to store information about all columns of all tables defined in innobase. each column corresponds to a record in this table.
TABLE_ID: ID of the table to which the column belongs (8 bytes ).
POS: indicates the number of columns (4 bytes) in the table ).
NAME: NAME of the column.
MTYPE: indicates the primary data type (4 bytes) of this column ).
PRTYPE: indicates some precise data types of this column. it is a combination value, including the NULL flag, whether it is a signed number, whether it is a binary string, and whether the column is a real VARCHAR (two bytes for data storage) (4 bytes ).
LEN: indicates the data length of this column, but does not include the VARCHAR type, because this type stores the data length (4 bytes) in the record ).
PREC: indicates the data precision of this column, but it does not seem to be used currently (4 bytes ).
The primary key columns of this table (TABLE_ID, POS ).
3) SYS_INDEXES is used to store index information of all tables in innobase. each record corresponds to one index.
TABLE_ID: ID of the table to which the index belongs (8 bytes ).
ID: ID of the index (8 bytes ).
NAME: indicates the index NAME of the index.
N_FIELDS: number of indexes for this index (4 bytes ).
TYPE: indicates the index TYPE, including clustered index, unique index, DICT_UNIVERSAL, DICT_IBUF (insert buffer B tree) (4 bytes ).
SPACE: ID of the tablespace where the index data is located (4 bytes ).
PAGE_NO: the root page (4 bytes) of the B-tree corresponding to the index ).
The table's primary key column (TABLE_ID, ID ).
4) SYS_FIELDS is used to store the index columns defined in all indexes. each record corresponds to one index column.
INDEX_ID: the index of the column (8 bytes ).
POS: the number of index columns (4 bytes) in an index ).
COL_NAME: name of the index column.
The primary key columns of this table (INDEX_ID and POS ).
From the preceding dictionary table, we can see how innobase manages tables, indexes, columns, and keywords. how are these tables loaded? First, start with creating a database.
When innobase is started, if you want to initialize the database, you need to create information such as the B tree of dictionary management. Then, call the dict_create function in innobase_start_or_create_for_mysql to create the database, because the structure and number of system tables in innobase are fixed and won't be modified, therefore, when initializing the database, you only need to create the storage B tree for these tables (corresponding to the index mentioned above, one index and one B tree, at the same time, the root page numbers of these B trees are stored in a fixed location, without storing the information of these tables in the system table. In innobase, a special page (page 7 of file 0 in table 0) is used to manage Dictionary information, this page is used to store the five page numbers of the four tables mentioned above, and the next table ID value (the ID number of the global variable when creating a new table is allocated here, each allocated one, add 1 to this ID number), the next index ID value (the ID number is allocated here when the index is created, and 1 is added for each allocation), and the next tablespace ID value (same as above) and ROWID (this is the ROWID number in the table, which is described below. These operations are implemented through the dict_hdr_create function. Here, two SYS_TABLES indexes are created through btr_create, one SYS_COLUMNS index, one SYS_INDEXES index, and one SYS_FIELDS index.
After innobase creates the B-tree and some other initialization operations, it loads the four system tables in the resident memory and reads some other information through the dict_boot function. as mentioned above, the number and structure of SYSTEM tables are not modified. Therefore, you can directly build these tables using fixed hard coding, for example:
Create SYS_TABLES:
Table = dict_mem_table_create ("SYS_TABLES", DICT_HDR_SPACE, 8, 0 );
Dict_mem_table_add_col (table, heap, "NAME", DATA_BINARY, 0, 0 );
Dict_mem_table_add_col (table, heap, "ID", DATA_BINARY, 0, 0 );
Dict_mem_table_add_col (table, heap, "N_COLS", DATA_INT, 0, 4 );
......
Create an index for the SYS_TABLES table:
Index = dict_mem_index_create ("SYS_TABLES", "CLUST_IND ",
DICT_HDR_SPACE,
DICT_UNIQUE | DICT_CLUSTERED, 1 );
Dict_mem_index_add_field (index, "NAME", 0 );
The process is roughly described above.
After initialization, because it is resident memory, the four tables are attached to a global dictionary structure:
Struct dict_sys_struct {
Mutex_t mutex;
Row_id_t row_id;
Hash_table_t * table_hash;
Hash_table_t * table_id_hash;
UT_LIST_BASE_NODE_T (dict_table_t) table_LRU ;/*! <LRU list of tables */
Ulint size;
Dict_table_t * sys_tables ;/*! <SYS_TABLES table */
Dict_table_t * sys_columns ;/*! <SYS_COLUMNS table */
Dict_table_t * sys_indexes ;/*! <SYS_INDEXES table */
Dict_table_t * sys_fields ;/*! <SYS_FIELDS table */
};
Obviously, the bottom four of the struct are used to store the corresponding four dictionary tables. The first mutex is not required. let's talk about ROWID management. in innobase, not all records in a user table may have a ROWID. ROWID is allocated only when a table does not have a primary key defined, that is, when ROWID itself acts as an index column, this ROWID is not an ID space exclusive to a table, but a global one. all tables share this ID number. generally, it is like TABLEID and INDEXID, the value of the dictionary page is updated once each time a row is assigned, but this is not the case for ROWID, because the insert operation is more frequent than creating a table or index, every modification has no significant impact on efficiency, so innobase has also been optimized accordingly, that is, every time a ROWID is allocated, the system only adds 1 to the memory and does not modify the page, the value is written only once when it is a multiple of 256, so naturally, if the value is inserted 200 times, it is not written yet. if the system restarts, isn't it the ID? The number will be used again. this is of course not a problem, that is, another job in dict_boot is to align the ROWID value written last time to 256 and add 256, in this way, there will be no problem. in a big case, you may skip a lot of ID numbers, resulting in this value increasing too fast.
The following three linked lists and HASH tables in the structure above are used to store cache for all tables in innobase, including system tables and user-created tables. There are two hashes here, one is cached by name, and the other is cached by ID. The LRU linked list is used to manage the cache of table objects, involving the elimination operation.
After the dictionary object storage and management is introduced above, the following describes a loading process for a common user table. when a user accesses a table, as described in the previous article "dictionary cache for table objects", the system first looks for the SHARE object of the table from the table object cache pool, if it is found, it can be used directly in the instantiated and space linked list with an instantiated table object. if there is no available instantiated object, it needs to be re-opened (instantiate this table ), when instantiating the table, you need to find the dictionary information of the table, including the table itself, column information, and index information. this operation is implemented through the subject dict_table_get described below.
The implementation in this function is to find the information described above for the specified table. The process is as follows: first, look for the information from the dictionary cache to see if the table has been cached in the HASH table mentioned above, if it is found, use it directly. if it is not found, use dict_load_table to load from the four system tables mentioned above. First, find SYS_TABLES. The process is the same as that for common tables, it is also the first way to find the cache, and then load it from the system table if it cannot be found, but this seems nonsense, because the system table is loaded at system startup and cannot be found, after finding it, construct a query key value. because it is queried by name, and SYS_TABLES has a search index sorted by name, you can directly construct the query key value by name, then, query the corresponding record from tree B. If no record is found, an error is returned. after the record is found, the record is parsed based on the record format of the table, take out the basic information such as ID, N_COLS, TYPE, and SPACE, and create a memory object for the table based on the information. The table's own object is loaded here, then, you need to load information about all its columns.
The column loading operation is basically the same as the above principle. find the system table SYS_COLUMNS, because the clustered index of this table is the table ID and POS, so in the B tree, if the table IDs of multiple records are the same, the POS is sorted in ascending order. Therefore, you only need to construct the table ID when constructing the query key value, load the corresponding information of each record in sequence according to the retrieved column information. after loading all the records with the same table ID, all columns in this table are loaded accordingly. Note that in innobase, a table column consists of two parts: one part is the column specified during table creation, and the other part is the system column, contains three columns: ROWID, TRXID, and ROLLPTR. ROWID indicates the row number of the record. as mentioned above, TRXID indicates the transaction number that was last modified for this record, it is mainly used for multi-version (MVCC) management of transactions. ROLLPTR is also used for multi-version management. if one record is modified by one user and another user is invisible to this record, when querying this record, you need to find its original value. ROLLPTR specifies the position of its original value, which is actually the position of the rollback record written during modification. Therefore, for any table, the columns included include the user-defined columns and the three columns.
Next, load the index information of the table. The index is queried from SYS_INDEXES. The principle is the same as that of SYS_COLUMNS. the keywords of this table are table ID and Index ID, therefore, all index records with the same table ID are sorted by index ID. Therefore, each record corresponds to an index, basic information such as ID, name, N_FIELDS, TYPE, PAGENO, and SPACE needs to be loaded. for index loading, all the corresponding keyword information needs to be loaded, which is stored in the SYS_FIELDS system table, the keywords of this table are INDEXID and POS. Therefore, all the keyword columns of an index are sorted by POS. this is very important because if there are multiple sorting columns, different sorting results are different, so the POS values must be loaded from small to large (B-tree storage order). The index keyword information includes the Index ID number, POS, and column name, loading an index is completed after loading all the keyword columns with the specified index id. However, you must note that if an index is not a unique index, you must To load the ROWID column that has been loaded in the table to this index as the first keyword column of this index, if it is a unique index, you do not need to load the ROWID column, instead, it directly loads its own defined columns. After loading all the keyword columns (either the ROWID column or the custom column), you also need to load the other two system columns, including the TRXID and ROLLPTR columns. For clustering indexes, because this index stores all the columns in the table, all columns except the keywords need to be loaded later. These columns are loaded in the order of table creation, for secondary indexes, these columns do not need to be loaded. Load all indexes in a table according to the same principle.
The loading of this table is complete, so we can see the information and sequence of the columns loaded in an index. Summary:
Unique clustered index:
[Ordered keyword columns] [TRXID] [ROLLPTR] [non-keyword columns created in other tables]
Number of compare columns: number of ordered keyword columns
Non-unique index of clustering:
[ROWID] [TRXID] [ROLLPTR] [non-keyword columns created in other tables]
Compare the number of columns: only one ROWID column.
Secondary unique index or secondary non-unique index (clustered index is unique index ):
[Ordered index column] [sequence column of the remaining clustered index] [TRXID] [ROLLPTR]
Compare the number of columns: number of ordered keyword columns plus number of remaining clustered index ordered columns
Secondary unique index or secondary non-unique index (clustered index is not unique index ):
[Ordered index column] [ROWID] [TRXID] [ROLLPTR]
Compare the number of columns: number of ordered keyword columns plus ROWID
We can see that for the record format definition such as innobase, when comparing two records, you only need to compare the column information of the number of columns from the first 0th columns, it is easy to use records. The above secondary indexes with the unique index of the cluster index may be created on one or some columns of the cluster index, or on other columns, then, the comparison column of the index column is to put the specified index column at the beginning of the order when the index is created, and the root column is all the other columns in the clustered index, the order is still the original order. Because the secondary index is used to quickly find the corresponding records in the secondary index, you can also quickly find the original records in the clustered index, the comparison column defined in the secondary index is to quickly find the clustered index, because these columns contain all the index columns of the clustered index, the order is consistent with that of the clustered index. If the clustered index of the secondary index is a non-unique index, the index key used to quickly index the clustered index record in the secondary index is the ROWID itself.
The above describes the management and index management of all system tables in innobase. However, for system tables, users cannot query the contents of the table as we generally think, it can only be managed and maintained by itself in the system. for example, tables such as SYS_TABLES are not recognized by MYSQL. this problem is also related to the features of MYSQL because it is plug-in-type, it must be compatible with all storage engines. not all storage engines have these system tables. Therefore, you cannot directly query or access the innobase system tables through MYSQL.
Conclusion: the design of innobase dictionaries and index records is very convenient and advanced. It is worth learning and learning.
BitsCN.com