MySQL File physical structure of the InnoDB file system

Source: Internet
Author: User
Tags bit set compact mysql version reserved rollback percona server

From the upper point of view, the InnoDB layer of files, in addition to the redo log, basically has a fairly uniform structure, are fixed block size, commonly used btree structure to manage data. Only application scenarios for different blocks will be assigned different page types. Typically, by default, the size of each block is univ_page_size, and without any configuration value of 16kb, you can also choose to specify a block size for the instance when you install it. For compressed tables, block size can be specified while the table is in progress, but the extracted pages in memory are still a uniform page size.

From the classification of physical files, there are log files, main system tablespace files ibdata, undo tablespace files, temporary tablespace files, user table spaces.

Log files are primarily used to record redo log,innodb in a circular way, and you can specify the number of files created and the size of each file by using parameters. By default, logs are written in 512-byte block units. Because the block size of the modern file system is typically set to 4k,innodb, it provides an option for users to populate the Redo log to 4KB to avoid read-modify-write phenomena, while Percona server provides another option , the block size of the redo log is directly modified to the specified value.

Ibdata is the most important system Tablespace file in InnoDB, which records the core information of InnoDB, including transaction system information, meta-data information, recording InnoDB change buffer btree, preventing data corruption of double write Buffer, etc. key information. We'll start with a description later.

The Undo stand-alone table space is an option, and the undo data is typically stored in Ibdata by default, but you can also configure the undo rollback segment to be assigned to a different file by configuring the option Innodb_undo_tablespaces, which currently opens the Undo Tablespace can only be performed at the install stage. After the mainstream version enters the 5.7 era, we recommend opening the standalone Undo table space only to take advantage of the new effects introduced in 5.7: Online undo truncate.

MySQL 5.7 opens a temporary tablespace where the default disk file is named IBTMP1, and all non-compressed temporary tables are stored in the table space. Because of the properties of the temporary table itself, the file is recreated when it is restarted. For cloud service providers, the Ibtmp file allows for better control over the disk storage generated by temporary files.

User table space, as the name implies, is used to create their own table space, usually divided into two categories, one is a tablespace file, the other is the 5.7 version introduced by the so-called general tablespace, under certain constraints, you can create multiple tables into the same file. In addition, InnoDB also defines some special-purpose IBD files, such as full-text index-related table files. For spatial data types, different data index format R-tree are also constructed.

In the key place the code function is noted, the reader is advised to read the code side of this article, the Code section of this article is based on MySQL version 5.7.11, different versions of the function name or logic may vary. Please try to select this version of the code when reading this article. File Management page

Each data file of the InnoDB belongs to a tablespace, and different tablespaces are tagged with a uniquely identified space ID. For example Ibdata1, ibdata2 ... Ownership of the system tablespace, with the same space ID. The IBD file created by the user creates the table and is considered to be a separate tablespace that contains only one file.

Each file is differentiated by a fixed page size, and by default, the page size of the uncompressed table is 16Kb. And within the file and according to 64 page (total 1M) a extent way to divide and manage. For a different page size, the corresponding extent size is also different, corresponding to:

Although the larger page Size is supported, data compression is not currently supported for large pages, because it involves modifying the fixed Size of the slot in the compressed page (which is actually not complex). Without making a statement, we use the 16KB page size by default to illustrate the physical structure of the file.

In order to manage the entire tablespace, in addition to the index page, the data file also contains a variety of management pages, as shown in a user table space about these pages to manage files, described below.

InnoDB Manage page file linked list

First, let's start with a file-based infrastructure, the file-linked list. In order to manage page,extent These data blocks, many nodes are recorded in the file to maintain linked lists with certain characteristics, such as Inode page lists maintained on file headers, idle, full, and fragmented extent linked lists, and so on.

In InnoDB, the list header is called Flst_base_node, and the size is flst_base_node_size (16 bytes). BASE node maintains the head and end pointers of the linked list, each node is called Flst_node, and the size is flst_node_size (12 bytes). The relevant structure is described as follows:

Flst_base_node:

Flst_node:

As mentioned above, the file list uses 6 bytes as the node pointer, and the contents of the pointer include:

The list structure is the infrastructure that manages all the page in the InnoDB table space, first feel, the concrete content can continue to read down.

InnoDB Table Space Page Management

Related code for file list management see: Include/fut0lst.ic, fut/fut0lst.cc fsp_hdr PAGE

The first page type of the data file is FIL_PAGE_TYPE_FSP_HDR, which is initialized when a new tablespace is created (Fsp_header_init), which is used to track the subsequent 256 extent (approximately 256MB file size) Space management, so every 256MB will create a similar data page, type Fil_page_type_xdes, Xdes page In addition to the head of the file, the other and Fsp_hdr page has the same data structure, can be called extent description page, Each extent occupies 40 bytes, and a xdes page describes a maximum of 256 extent.

The head of the Fsp_hdr page uses fsp_header_size bytes to record information about the file, including:

Using flag at the file header (corresponding to the above fsp_space_flags) describes the following key information when creating the table:

In addition to the above described information, the other part of the data structure and Xdes page (fil_page_type_xdes) are the same, using a continuous array, each xdes PAGE to store up to 256 xdes Entry, each Entry occupies 40 bytes, Describes 64 page (that is, one extent). The format is as follows:

Xdes_state represents four different states of the extent:

With xdes_state information, we only need one flist_node node to maintain the information of each extent, whether it is on a linked list of global table spaces or on a linked list of some btree segment. ibuf BITMAP PAGE

The 2nd page type is Fil_page_ibuf_bitmap, which is used primarily to track the change buffer information for each subsequent page, using 4 bits to describe the change buffer information for each page.

Since bitmap page has limited space, a IBUF bitmap page is created after the Xdes page after every 256 extent page.

With regard to change buffer, here we do not discuss, interested can read the previous monthly report:
MySQL Engine Features · Innodb Change Buffer Introduction (HTTP://MYSQL.TAOBAO.ORG/MONTHLY/2015/07/01/) INODE PAGE

The 3rd PAGE of the data file is of type Fil_page_inode, which manages the segement in the data file, consumes 2 segment per index, and is used to manage leaf nodes and non-leaf nodes, respectively. Each inode page can store a fsp_seg_inodes_per_page (default of 85) records.

The structure of each Inode entry is shown in the following table:

file maintenance

As we can see from the above, InnoDB uses the Inode entry to manage each data page occupied by each segment, each segment as a maintenance unit for a file page. The Inode page where the Inode entry is located may be full, so the Inode page list is maintained through the header page.

The first page of IBD also maintains the free, Free_frag, Full_frag three extent lists of extent in the tablespace, and each Inode entry maintains the corresponding free, not_full, Full three extent linked list. There is a transformation relationship between these lists to efficiently utilize the data file space.

When creating a new index, you actually build a new Btree (btr_create), assign an inode entry to the non-leaf node segment, create the root page, and record the location of the segment to the root page. It then assigns the inode entry of the leaf segment and logs it to the root page.

When an index is deleted, the space required by the index can be re-used.

Create segment
First, each segment needs to reserve a certain amount of space (fsp_reserve_free_extents) from the IBD file, usually 2 extent. However, if the table space is newly created and the current file is less than 1 extent, only 2 page will be allocated.

When file space is low, the file needs to be extended (Fsp_try_extend_data_file). The extension of the file follows certain rules: if the current is less than 1 extent, it expands to 1 extent full; When tablespace is less than 32MB, one extent is extended each time, greater than 32MB, 4 extent (fsp_get_pages_to_ EXTEND_IBD).

After the space is reserved, read the file header page and lock (Fsp_get_space_header), and then begin assigning Inode Entry (Fsp_alloc_seg_inode) to it. First, you need to find a suitable inode page.

We know that the Inode page has limited space, in order to manage the Inode page, the file header stores two inode page lists, a link is already full of inode page, a link is not yet full of inode page. If the current Inode page is running out of space, you need to allocate an inode page and add it to the Fsp_seg_inodes_free list (fsp_alloc_seg_inode_page). For stand-alone table spaces, an inode page is usually sufficient.

When the target Inode page is obtained, an idle (Fsp_seg_inode_page_find_free) unused slot is found from the page (idle indicates that it does not belong to any segment, that is, fseg_id is set to 0).

Once the records in the Inode page are full, they are transferred from the Fsp_seg_inodes_free list to the Fsp_seg_inodes_full list.

After the inode entry is obtained, the fsp_seg_id of the header page is incremented, and the SEG ID of the current segment is written to the Inode entry. The initialization of some columns is followed.

After the completion of the Inode entry extraction, the Inode entry the location of the Inode page and the offset in the page are stored in another page (for Btree is recorded in the root node, occupies 10 bytes, including Space ID, page no, Offset).

The root node of btree is actually assigned when the Non-leaf segment is created, and the root page is assigned to the first array element of the Frag array of the segment.

Segment Allocation entry function: Fseg_create_general

Assigning data pages
As btree data grows, we need to assign a new page to the segment of Btree. As we have already said, segment is a standalone page snap-in, and we need to incorporate the data space from the global into the management of segment.

Step 1: Space expansion

Pessimistic insertion (Btr_cur_pessimistic_insert) occurs when a decision to insert an index is possible, and the file is extended before the actual split operation, and an attempt is made to reserve (TREE_HEIGHT/16 + 3) Extent, In most cases, there are 3 extent.

Here's an unexpected scenario: if the current file does not have more than one extent and the requested page count is less than 1/2 extent, then if you specify a page number, make sure there are 2 free page available, or assign the specified page instead of allocating it in extent.

Note that this is only guaranteed to have enough file space to avoid file extent during Btree operation. If you expand the IBD file (fsp_try_extend_data_file) In this step, the new data page is not initialized or added to any linked list.

In determining whether there is enough free extent, the free space itself is reserved for IBD to be taken into account, for the normal user table space is 2 extent + file_size * 1%. These newly expanded pages are not initialized at this time and are not added to the page no of the header page's fsp_free_limit record, which identifies the scope of such an uninitialized page.

Step 2: Assign page to segment

Then go to the index splitting stage (Btr_page_split_and_insert), the new page assigns the upper call stack:

In the passed argument, there is a hint page no, which is usually the previous (direction = Fsp_down) or the last page no (direction = fsp_up) of the page no that needs to be split. The goal is to physically and logically neighboring nodes as close as possible.

In step 1 we have ensured that the physical space has enough data pages, but not yet initialized. The process of assigning a page to the current segment is as follows (Fseg_alloc_free_page_low):

1. Calculate the number of page used and occupied by the current segment

· The page number used to store includes the page number used on the Fseg_not_full list (stored in the fseg_not_full_n_used of the Inode entry) + the page number on the Fseg_full linked list with full segment + occupied Frag Array page number;

· The number of page occupied includes the number of Frag array page occupied by extent + Fseg_free, Fseg_not_full, fseg_full on three linked lists.

2. Obtain the corresponding Xdes entry (XDES_GET_DESCRIPTOR_WITH_SPACE_HDR) according to hint page

3. When the following conditions are met, the hint page can be removed directly using:

· The extent state is xdes_fseg, which means that it belongs to a segment

· Hint page where the extent has been assigned to the current segment (check entry of Xdes xdes_id)

· Hint page corresponds to the bit set to free, indicating that it is not yet occupied

· Back to Hint page

4. When the condition is met: 1) xdes entry is currently idle (Xdes_free), 2) The segment has been used more than the number of page occupied by the 7/8 (Fseg_fillfactor); 3) The current segment has used more than 32 Frag page, which means that the Frag array in its inode may be full.

· Extent (Fsp_alloc_free_extent), which is the hint page that is allocated from the tablespace, removes it from the Fsp_free list

· Set the status of the extent to xdes_fseg, write the Seg ID, and add it to the current segment Fseg_free linked list.

· Back to Hint page

5. When the following conditions: 1) Direction! = Fsp_no_dir, for Btree division, either fsp_up, or fsp_down;2) the space used is less than 7/8 of the occupied Space, 3) The current segment has used more than 32 Frag Page

· Try to get a extent (Fseg_alloc_free_extent) from segment, if the segment's Fseg_free list is empty, you need to allocate from the tablespace (fsp_alloc_free_extent) a extent And added to the current segment Fseg_free list

· When direction is Fsp_down, returns the last page of the extent, and returns the first page of the extent when it is fsp_up

6. Xdes entry belongs to the current segment and is not full, from which a free page is taken and returned

7. If the segment occupies more page than the useful page number, indicating that the segment also have an idle page, then first look at the fseg_not_full linked list whether there is not full extent, if not, then see Fseg_ Whether there is a fully idle extent on the free list. from which a free page is taken and returned

8. When the page number that is currently functional is less than 32 page, a separate page (Fsp_alloc_free_page) is allocated and added to the Inode's Frag array page array, and then the block is returned

9. When none of the above is satisfied, assign a extent (Fseg_alloc_free_extent) directly and return from it with a page.

The above process looks complex, but can be summed up as:

1. For a new segment, always fill in 32 Frag page arrays before assigning them full extent, you can take advantage of fragmented pages, and avoid small tables taking up too much space.

2. Try to get hint page;

3. If there are too many segment on the page, use the page on the segment as much as possible.

The above mentioned two from the table space for segment Allocation data page, one is to assign a separate data page, one is to assign the entire extent

Table space separate data page allocation call function Fsp_alloc_free_page:

1. If the hint page is located in the extent on the list Xdes_free_frag, can be used directly, otherwise from the Fsp_free_frag linked list based on the header page to see if there are available extent;

2. Failure to find a usable extent from the above, directly assigned a extent, and added to the Fsp_free_frag linked list;

3. From the obtained extent, find the page described as idle (xdes_free_bit).

4. Assign the page (Fsp_alloc_from_free_frag)

· Set the page corresponding to the bitmap of the xdes_free_bit to false, indicating being occupied;

· Increments the fsp_frag_n_used field of the header page;

· If the extent is full, it is removed from the Fsp_free_frag and added to the Fsp_full_frag linked list. At the same time, the fsp_frag_n_used of the head page decrements by 1 extent (fsp_frag_n_used only stores the number of page used by extent);

· Initializes the page content (fsp_page_create).

Table space Extent allocation function Fsp_alloc_free_extent:

1. Usually first through the first page to see if there are idle extent on the fsp_free linked list, if not, then the new extent (for example, the above step 1 to expand the file to create a new page, from Fsp_free_limit) added to the FSP_ On the free list (fsp_fill_free_list):

· 4 Extra Extent (Fsp_free_add) at a time;

· If the Xdes page is involved, the Xdes page also needs to be initialized;

· If there is a system administration page like Xdes page in extent, this extent is added to the Fsp_free_frag list instead of the Fsp_free list;

· The first extent on the list is currently used;

2. Remove the acquired extent from Fsp_free and return the corresponding Xdes entry (xdes_lst_get_descriptor).

Recycle page
There are two kinds of data page recycling, one is the whole extent collection, and the other is the recycle of the fragment pages. Occurs when an index page or drop index is dropped.

When the data on a data page is erased, we need to remove the page from its Segmeng (Btr_page_free-->fseg_free_page-Fseg_free_page_low), and the process of recycling is relatively simple:

1. First if it is the page in the Frag array of the segment, set the corresponding slot to Fil_null and return it to the Tablespace (fsp_free_page):

· The state of page in Xdes entry is set to idle;

· If the page is located extent in the Fsp_full_frag linked list, it is transferred to Fsp_free_frag;

· If the page in extent is completely freed, release the extent (fsp_free_extent) and transfer it to the Fsp_free linked list;

· return from function;

2. If the page extent is currently on the fseg_full linked list of the segment, it is transferred to the fseg_not_full linked list;

3. Set the xdes_free_bit of the page to Xdes entry bitmap true;

4. If the page on this extent is all released at this point, it is removed from the Fseg_not_full list and added to the table space's Fsp_free list (not the Segment Fseg_free list).

Release segment
When we delete the index or the table, we need to delete btree (btr_free_if_exists), remove the other than the root node (btr_free_but_not_root), and then delete the root node (btr_free_root)

Because data operations need to record redo, in order to avoid generating very large redo Log,leaf segment by repeatedly calling function Fseg_free_step to free its occupied data pages:

1. First find the corresponding Inode entry (fseg_inode_try_get) of the leaf segment;

2. Then look for the fseg_full in the Inode entry, or the Fseg_not_full, or the Fseg_free list, and find a extent, noting that the location of the linked table tuple is in fact the extent that describes the Xdes The location where the entry is located. Therefore, the corresponding Xdes page and page offset (xdes_lst_get_descriptor) can be quickly located.

3. Now we can safely release this extent (Fseg_free_extent, see later);

4. Segment will also occupy up to 32 pieces of pages and be released sequentially (Fseg_free_page_low) When all extent are released after repeated calls to Fseg_free_step

5. Finally, when the page occupied by the Inode is released, the Inode entry is released:

· If the Inode page in which it is located is currently full, we need to transfer from Fsp_seg_inodes_full to Fsp_seg_inodes_free (update the first page) because we are about to release a slot;

· The seg_id of the Inode entry is cleared to 0, indicating that it is unused;

· If all Inode entry on the Inode page is freed, it is removed from Fsp_seg_inodes_free and the page is deleted.

Recovery of Non-leaf segment is basically similar to the recovery of leaf segment, but note that the root node of btree is stored in the first tuple of segment frag of that arrary, and the page is temporarily not free (fseg_free_step_ Not_header)

The root page of Btree is released after completing the above steps to completely release the Non-leaf Segment index page

The structure of the real user data in IBD files is btree, when you create a table, you have built a btree based on an explicit or implicitly defined primary key, and its leaf nodes record all of the row's column data (plus the Transaction ID column and the rollback segment pointer column); If you create a level two index on the table, Its leaf node stores the key value plus the clustered index key value. In this section we explore the structure of the physical storage page that makes up the index, where the non-compressed pages are discussed by default, and the contents of the compressed pages are described in the next section.

Each btree uses two segment to manage data pages, one manages leaf nodes, one manages non-leaf nodes, each segment a record entry in the Inode page, and two segment messages are recorded in the root page of btree.

When we need to open a table, we need to load metadata information from the Ibdata Data dictionary table, where the table, index, and page no (dict_fld__sys_indexes__page_no) corresponding to the index root pages are recorded in the Sys_indexes system table. Then find the Btree root page, you can work on the entire user data btree.

The most basic page type of the index is fil_page_index. Can be divided into the next few sections.

Page Header
First, regardless of any type of data page has 38 bytes to describe the header information (Fil_page_data, or Page_header), contains the following information:

Index Header
Following fil_page_data is the index information, which is unique to the index page.

Segment Info
The next 20 bytes describe the segment information, which is set only in the root page of Btree, and the other page is unused.

The 10-byte inode information includes:

With the above information, we can find the corresponding segment in the Inode page, and then we can manipulate the entire segment.

System record
Two system records are then used to describe the minimum and maximum values on the page, and there are two ways to store the old InnoDB file system and the new file System (Compact page), respectively.

The compact system records are stored in the following ways:

The main difference between the two formats is that the description information for a single record is different in a non-peer storage mode. When the page is actually created, the value of the system record has been initialized, for the old format (redundant), the corresponding code in the Infimum_supremum_redundant, for the new format (compact), corresponding to Infimum_supremum _compact. The fixed heap no for the infimum record is 1 for the 0,supremum record. The node always points to the supremum record after the smallest user record on the page always points to the largest record on infimum,page.

Specific Reference index page creation function: Page_create_low

User Records
After the system record is the real user record, the heap no starts from 2 (page_heap_no_user_low). Note that heap no represents only the physical storage order and does not represent the order of key values.

Depending on the type, the user record can be either a node pointer information for a non-leaf nodes or a leaf node record that contains only valid data. Different row formats store different row records, such as the redundant format used in earlier versions, which describes records by using more bytes in the current compact format, such as some column information describing records, which can be obtained directly from the data dictionary when using the compact format. Because redundant is an increasingly discarded format, we use the compact format by default in this discussion. The header comment in the file rem/rem0rec.cc describes the physical structure of the record.

Each record has a rec header, which is described below (see file Include/rem0rec.ic)

The data after recording the header information is different depending on the situation:

· For a clustered index record, the data contains the transaction ID, and the rollback segment pointer;

· For level two index records, the data contains a level two index key value and a clustered index key value. If the level two index key and the clustered index have coincident, then only one copy of the coincident, such as PK (col1, col2), Sec key (col2, col3), is only included in the two-level index record (col2, col3, col1);

· For non-leaf node page records, the clustered index contains the minimum record key value of its child nodes and the corresponding page no; the secondary index is different, except for the two-level index key value, but also contains the clustered index key value, plus the page no three components.

Free space
This refers to a complete unused space, within the page's last user record and between page directory. Usually, if the space is sufficient, the record space is allocated directly from here. When the empty space is determined to be insufficient, a rearrangement of the page is done in order to merge the fragmented space.

Page Directory
To speed up data lookups in the page, a slot is allocated per 4~8 (page_dir_slot_min_n_owned ~ page_dir_slot_max_n_owned) of user records in the order in which they are recorded (each slot consumes 2 bytes , Page_dir_slot_size), which stores the in-page offset of a record, can be understood as a small index built within the page (sparse index) to aid in binary lookups.

The Slot Allocation for page directory is assigned in reverse order from the end of the page, starting at the eighth byte. When a record is queried. The scope of the record is determined based on page directory, and then a linear query is made accordingly.

function to add slots see Page_dir_add_slot

Functions for finding binary records in pages see Page_cur_search_with_match_bytes

FIL Trailer
8 bytes (fil_page_data_end or fil_page_end_lsn_old_chksum) are reserved at the end of each file page, where 4 bytes are used to store page checksum, This value needs to match the checksum of the page header record, otherwise the page corruption (buf_page_is_corrupted) is considered to compress the index page

InnoDB currently exists in two forms of compressed pages, one is transparent page Compression, there is a traditional compression method, the following respectively elaborated. Transparent Page Compression

This is a MySQL5.7 new data compression method, the principle is to use the kernel punch hole features, for a 16KB data page, before writing the file, in addition to the page header, other parts of the compression, compressed and white place using punch hole for "hole", It behaves as if it is not occupying space on disk (but generates a lot of disk fragmentation). This method has a better compression ratio than the traditional compression method, and it is simpler to implement the logic.

The new type fil_page_compressed is introduced for this compression method, which is slightly different in the storage format, mainly in the form of 8 bytes starting from FIL_PAGE_FILE_FLUSH_LSN, which is used as a record compression information:

The actual storage space of the page after the hole will need to be an integer multiple of the disk's block size.

Here we do not unfold the elaboration, specifically refer to I wrote earlier this article: MySQL Community News · InnoDB Page Compression (http://mysql.taobao.org/monthly/2015/08/01/) Traditional Compressed storage format

When you create or modify a table and specify Row_format=compressed key_block_size=1|2|4|8, the IBD file created will be divided by the corresponding block size. For example, Key_block_size is set to 4 o'clock, corresponding block size is 4KB.

The format of the compressed page can be described as shown in the following table:

In memory, there are usually compressed pages and page two data. When the data is modified, the decompression page is usually modified first, and the DML operation is recorded in a special log format in the Mlog of the compressed page. To reduce the number of times it is re-compressed during the modification process. These operations are mainly included:

· Insert: Write full record to Mlog

· Update:

· Delete-insert update, marks the dense slot of the old record as deleted, and then writes the full new record

· In-place Update, write the newly updated record directly

· Delete: Tag corresponding dense slot for delete

Page compression See functions Page_zip_compress
Page decompression See functions page_zip_decompress System data page

Here we collectively refer to all non-independent data pages as system data pages, primarily stored in Ibdata, as shown in:

InnoDB System Data page

The three page of Ibdata is the same as the normal user table space, and is used to maintain and manage file pages. The other page we introduce below.

Fsp_ibuf_header_page_no
The 4th page of Ibdata is the header page of change buffer, type Fil_page_type_sys, which is used primarily for page management of Ibuf Btree.

Fsp_ibuf_tree_root_page_no
The root page,change buffer used to store the change buffer is currently stored in Ibdata, which is essentially a btree,root page, which is a fixed page, which is the 5th pages of Ibdata.

Ibuf HEADER page and Root page combine to manage IBUF data pages.

First of all, Ibuf Btree maintained a free page linked list, the link header record in the root node, the offset at page_btr_ibuf_free_list, in fact, the use of the common index root node page_btr_seg_leaf field. The page type on the free LIST is labeled Fil_page_ibuf_free_list

Each ibuf page reuses the Page_btr_seg_leaf field to maintain the front and back File page nodes (Page_btr_ibuf_free_list_node) of the Ibuf free LIST.

Because the segment field in the root page has been reused, an additional page, the 4th page of Ibdata, is created to manage the segment. In which the segment header of the Ibuf Btree is recorded, pointing to the inode btree belonging to the IBUF entry.

About the construction of IBUF btree see functions Btr_create

Fsp_trx_sys_page_no/fsp_first_rseg_page_no
Ibdata's 6th page, which records InnoDB important transaction system information, mainly includes:

In version 5.7, the rollback segment can be either in Ibdata or in a standalone undo table space, or in a ibtmp temporary tablespace, as shown in this article [http://mysql.taobao.org/monthly/2015/04/01/], which precedes the self-pick.

InnoDB Undo rollback Segment Structure

Since the transaction system was initialized at the start of the system, the NO. 0 Rollback Segment Header page is always in the 7th page of Ibdata.

Transaction system creation See functions Trx_sysf_create

InnoDB can create up to 128 rollback segments, each of which requires a separate PAGE to maintain its own undo slot,page type of Fil_page_type_sys. The description is as follows:

Rollback Segment Header page creation See function Trx_rseg_header_create

The PAGE type of the actual storage UNDO record is Fil_page_undo_log,undo header structure as follows

Undo page structure and its relationship to the Rollback Segment header page See:

InnoDB Undo Page Internal structure

For a specific undo log how to store, this article does not expand the description, you can read my previous article: MySQL Engine Features · InnoDB Undo Log Roaming (http://mysql.taobao.org/monthly/2015/04/01/)

Fsp_dict_hdr_page_no
The 8th page of Ibdata, which is used to store information in the Data dictionary table (only to get the data dictionary table, to further find its corresponding table space based on the table information stored in it, and page no of the table's clustered index)

The structure of the DICT_HDR page is shown in the following table:

DICT_HDR page creation See functions Dict_hdr_create

Double Write buffer
InnoDB uses double write buffer to prevent part of the data page from writing, and before writing a data page, always write a double-write buffer before writing the data file. When the crash resumes, if the page in the data file is corrupted, it tries to recover from DBLWR.

Double write buffer is stored in Ibdata, where you can get DBLWR from the Transaction system page (the 6th page of Ibdata). A total of 128 page, divided into two block. Since the DBLWR was initialized when the instance was installed, the two blocks had a fixed position in the Ibdata, Page64 ~127 was the first block,page 128 ~191, which was zoned second block.

of the 128 pages, the first 120 page is used for dirty page writeback at batch flush, and the other 8 pages are for dirty page writeback when single page is flush. External Storage Pages

For large print segments, InnoDB uses external pages for storage when certain conditions are met. There are three types of external storage pages:

1. Fil_page_type_blob: Represents a non-compressed external storage page, as shown in the structure:

2. FIL_PAGE_TYPE_ZBLOB: Compressed external storage page, if more than one BLOB page exists, indicates the first
FIL_PAGE_TYPE_ZBLOB2: If there are more than one compressed BLOB page, the next page of the Blob chain is represented;
The structure looks like this:

Instead, only 20-byte pointers are stored in the record to point to the external storage page, and the pointer is described as follows:

Write to external page see function Btr_store_big_rec_extern_fields MySQL5.7 new data page: Encrypt page and R-tree page

MySQL version 5.7 introduces new data pages to support tablespace encryption and to establish R-tree indexes on spatial data types. This paper does not do in-depth discussion of this data page, simply described below, we will open two separate articles separately to introduce.

Data encryption page
Starting from MySQL5.7.11 the INNODB supports encrypting a single table, so a new page type is introduced to support this feature, with three main page types added:

· Fil_page_encrypted: Encrypted Plain data page

· Fil_page_compressed_and_encrypted: Data page is compressed (transparent page compression) and encrypted (compressed, then encrypted)

· Fil_page_encrypted_rtree:gis index R-tree Data page and is encrypted

For encrypted pages, in addition to the data part being replaced with encrypted data, the rest and most tables are the same structure.

The decryption logic is similar to transparent compression, which encrypts (Os_file_encrypt_page--and Encryption::encrypt) before writing to the file, decrypting the data when the file is read (Os_file_io_ Complete--Encryption::d Ecrypt)

The key information is stored in the first page of the IBD file (Fsp_header_init--fsp_header_fill_encryption_info) when executing SQL ALTER INSTANCE ROTATE INNODB MASTER key updates the secret key information for each IBD store (fsp_header_rotate_encryption)

When installed by default, a new plug-in Keyring_file is installed and the default active, in the installation directory, will generate a new file to store the secret key, located in $mysql_install_dir/keyring/keyring, You can specify the location and file name of the secret key by using the parameter keyring_file_data. When you install multiple instances, you need to specify the keyring file for different instances.

The syntax for opening table encryption is simple, specifying the option encryption= ' Y ' when the CREATE TABLE or ALTER TABLE is turned on, or encryption= ' N ' to turn off encryption.

For InnoDB table space encryption features, see the Commit and official documentation.

R-tree Index page
In MySQL 5.7, a new index type, R-tree, is introduced to describe multidimensional data structures of spatial datatype, such as the data page type of Fil_page_rtree.

Related design of R-tree See official wl#6968, wl#6609, wl#6745 temporary tablespace ibtmp

MySQL5.7 introduces a table space dedicated to temporary tables, which is named IBTMP1 by default, and the non-compressed temporary tables created are stored in that tablespace. After the system restarts, the IBTMP1 is reinitialized to the default of 12MB. You can modify the default initial size of IBTMP1 by setting the parameter Innodb_temp_data_file_path and whether to allow autoextent. The default value is "Ibtmp1:12m:autoextent".

In addition to user-defined non-compressed temporary tables, the rollback segments dedicated to the 1th to 32nd temporary table are also stored in the file (the No. 0 rollback segment is always stored in ibdata) (Trx_sys_create_noredo_rsegs), and the log file Ib_logfile

About the format of the log file, there has been a lot of discussion online, in the previous series of articles I have specifically introduced, this section mainly introduces the next MySQL5.7 new changes.

The first is the change of the checksum algorithm, the current version of the MySQL5.7 can be innodb_log_checksums by the parameter to turn redo checksum on or off, but the only checksum algorithm currently supported is CRC32. In previous versions, only the checksum algorithm of the less efficient InnoDB itself was supported.

The second change is the introduction of version information (wl#8845) for redo log, stored in Ib_logfile's head, starting with the file header, as described below

The file header information is updated each time you switch to the next iblogfile (Log_group_file_header_flush)

The new version supports the older version (RECV_FIND_MAX_CHECKPOINT_0), but after upgrading to the new version, it is not possible to downgrade In-place to the old version in an abnormal state (unless you do a clean shutdown and clear iblogfile).

MySQL File physical structure of the InnoDB file system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.