Detailed Clustered Index

Source: Internet
Author: User

The structure of the cluster index

A clustered index is not a separate index type, but a way of storing data. The details depend on how they are implemented, but InnoDB's clustered index actually holds the B-tree index and data rows in the same structure. When a table has a clustered index, its data rows are stored in the leaf page of the index. The term "poly" means that the data rows and adjacent key values are stored together compactly. Because data rows cannot be placed in two different places at the same time, a table can have only one clustered index.

Because the storage engine is responsible for implementing indexes, not all storage engines support clustered indexes. Here we focus primarily on InnoDB, but the principles discussed here are applicable to any storage engine that supports clustered indexes.

The following shows how the records in the clustered index are stored. Notice that the leaf page contains all the data for the row, but the node page contains only the indexed columns.


Aggregating data through primary keys in InnoDB, which means that the "indexed column" is the primary key column. If no primary key is defined, InnoDB chooses a unique non-empty index instead. Without such an index, InnoDB implicitly defines a primary key to be used as a clustered index. InnoDB only the records that are clustered on the same page. Pages that contain adjacent keys can be very far apart.

Clustered primary keys can be useful for performance, but can also cause serious performance problems. Therefore, it is necessary to consider the clustering index carefully, especially when the engine of the table is changed from InnoDB to other engines.

Second, the advantages of clustering index

    • You can keep the relevant data together. For example, when implementing an e-mail message, you can aggregate the data based on the user ID, so that only a small number of data pages are read from disk to get all the messages for a user. If you do not use a clustered index, each message can cause disk I/O at a time;
    • Faster data access. A clustered index saves the index and data in the same b-tree, so getting data from a clustered index is usually faster than finding it in a nonclustered index.
    • Queries that use the overwrite index Scan can directly use the primary key values in the node.

Third, the shortcomings of the cluster index

    • Clustered data maximizes the performance of I/O-intensive applications, but if the data is all in memory, the order of access is less important, and the clustered index is less advantageous;
    • The insertion speed is heavily dependent on the insertion order. Inserting in the order of the primary key is the fastest way to load data into the InnoDB table. However, if you do not load the data in the primary key order, it is best to reorganize the table using the Optimize table command after loading.
    • Updating a clustered index column is expensive because it forces InnoDB to move each updated row to a new location.
    • Clustered index-based tables may face "page splitting" when inserting new rows, or when the primary key is updated causing the need to move rows. When a row's primary key value requires that the row be inserted into a full page, the storage engine splits the page into two pages to accommodate the row, which is a split operation. Page splitting causes the table to consume more disk space.
    • Clustered indexes can cause full table scans to slow down, especially when rows are sparse, or when data storage is discontinuous due to page splitting.
    • Secondary indexes (nonclustered indexes) may be larger than expected because the leaf nodes in the Level two index contain the primary key columns of the reference row.
    • Secondary index access requires two index lookups, not one at a time.

Note: questions about two index lookups are required for level two indexes? The answer lies in the essence of the "row pointer" saved in the Level two index. Keep in mind that the two-level index leaf node holds a pointer to the physical location of the row instead of the row's primary key value. This means that a row is found through a level two index, and the storage engine needs to find the corresponding primary key value for the leaf node of the level two index, and then finds the corresponding row in the clustered index based on that value. Here are the repetitive tasks: two times B-tree lookup instead of once. For InnoDB, adaptive hash indexes can reduce this duplication of effort.

Data distribution comparison of four, InnoDB and MyISAM

There are differences in the data distribution between clustered and non-clustered indexes, as well as the data distribution of the corresponding primary key index and the two-level index.

1. MyISAM primary key index and level two index

The MyISAM data distribution is very simple and myisam stored on disk in the order in which the data is inserted. The line number is displayed next to the row, incrementing from 0. Because the rows are fixed-length, MyISAM can skip the required bytes from the beginning of the table to find the desired rows. This distribution method is easy to create an index. Also, there is no structural difference between the primary key index and other indexes in MyISAM. The primary key index is a unique, non-empty index named primary. Such as:

1, MyISAM data line distribution


2, the MyISAM of the primary key distribution


3. Other index distributions on MyISAM

2. InnoDB primary key index and level two index

InnoDB data distribution, because INNODB supports clustered indexes, indexes use very different ways to store such data, such as:


Looking closely, you will notice that the graph shows the entire table, not just the index. Because the clustered index is a table in InnoDB, it does not require separate row storage like MyISAM. Each leaf node of a clustered index contains primary key values, transaction IDs, rollback pointers for transactions and MVCC, and all remaining columns. If the primary key is a column prefix index, INNODB also contains the full primary key column and the rest of the remaining columns.

And the difference between MyISAM and InnoDB is that the level two index and the clustered index are very different. The leaf node of the InnoDB two index is not a "row pointer", but a primary key value, which is used as a pointer to the row. Such a strategy reduces the maintenance of the two-level index when there is an aerial movement or a split data page. Using a primary key value as a pointer makes a two-level index take up more space, in exchange for the benefit that INNODB does not need to update the "pointer" in the two-level index when moving rows. Is InnoDB's Level two index:


3. Comparison of MyISAM and InnoDB


V. Inserting rows in the InnoDB table by primary key order

If you are using the InnoDB table and there is no data to aggregate, you can define a surrogate key as the primary key, and the primary key's data should be independent of the application, and the simplest way is to use the Auto_increment self-increment column. This ensures that the data rows are written sequentially, and that the performance of the associated operation based on the primary key can be better.

It is best to avoid random clustered indexes, especially for I/O intensive applications. For example, from a performance perspective, using the UUID as a clustered index can be bad: it makes the insertion of the clustered index completely random, which is the worst case, leaving the data without any aggregation characteristics. By testing, inserting a row into the UUID primary key takes longer, and the index takes up more space. This is due to the fact that the primary key field is longer, and the other is undoubtedly caused by page splitting and fragmentation.

This is because when the value of the primary key is sequential, InnoDB stores each record behind the previous record. When the maximum fill factor for a page is reached (the InnoDB default maximum fill factor is 15/16 of the page size, leaving some space for later modification), the next record is written to the new page. Once the data is loaded in this order, the primary key page is approximated by the sequential record filling, which is the desired result.

When a table with a clustered index of UUID inserts data because the new row's primary key value is not necessarily larger than the previous insert value, InnoDB cannot simply insert the new row at the end of the index, but instead needs to find the appropriate location for the new row----usually the middle position of the existing data----and allocate space. This adds a lot of extra work and causes the data distribution to be less than optimized. Here are some of the drawbacks of the summary:

    • The write target page may have been brushed to disk and removed from the cache, or it has not been loaded into the cache, and the InnoDB has to be found and read from disk to memory before inserting, which will result in a lot of random I/O;
    • Because the write is disorderly, InnoDB has to frequently do page splitting operations to allocate space for new rows. Page splitting causes a large amount of data to be moved, and at least one insertion needs to be modified by three pages instead of one page.
    • Because of frequent page splits, the pages become sparse and are filled irregularly, so the final data is fragmented.
    • After loading these random values into the clustered index, you need to do a optimize table to rebuild the table and optimize the fill of the page.

NOTE: Sequential primary keys also have drawbacks: for high concurrent workloads, inserting in the primary key order in InnoDB may cause significant contention. The upper bound of the primary key becomes a "hotspot". Because all insertions occur here, concurrent insertions can lead to Gap lock contention. Another hotspot may be the auto_increment lock mechanism; If you encounter this problem, you may want to consider re-designing the table or application, or changing the Innodb_autonc_lock_mode configuration.

Detailed Clustered Index

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.