Deep understanding of MySQL Cluster Index

Source: Internet
Author: User

Deep understanding of MySQL Cluster Index

Clustered index is not a separate index type, but a data storage method. When a table has a clustered index, its data rows are actually stored in the index's leaf page. The term "clustering" indicates that data rows and adjacent health values are stored in a compact manner. Because data rows cannot be stored in two different places at the same time, a table can only have one clustered index.

The storage of clustered indexes is as follows:

Note that the leaf page contains all the data of the row, but the node page only contains the index column. In this figure, the index column contains an integer.

The clustered index is the primary key by default. If no primary key is defined in the table, InnoDB selects a unique non-null index instead. Without such an index, InnoDB implicitly defines a primary key as a clustered index. InnoDB only aggregates records on the same page. Pages that contain adjacent values may be far away.

Advantages of clustered index:

  1. Data can be stored together. For example, you can collect data based on the user ID when implementing email, so that you only need to read a few data pages from the disk to get all emails of a user. If the clustered index is not used, each mail may cause a disk I/O.
  2. Faster data access. Clustering indexes store indexes and data in the same B-Tree. Therefore, retrieving data from clustering indexes is usually faster than searching in non-clustering indexes.
  3. The primary key value in the page node can be directly used for queries that overwrite the index scan.

Disadvantages of clustered index:

  1. Clustering data maximizes the performance of I/O-intensive applications. However, if all the data is stored in the memory, the access sequence is less important. Clustering index has no scanning advantage.
  2. The insert speed depends heavily on the insert sequence. Inserting Data in the order of primary keys is the fastest way to load data to InnoDB tables. However, if the data is not loaded in the primary key order, it is best to use the optimize table command to reorganize the TABLE after loading.
  3. Updating clustering index columns is costly because InnoDB is forced to move each updated row to a new location.
  4. When a table based on Clustered index inserts a new row or the primary key is updated, it may face the "page split" problem when moving rows. When the primary key value of a row must be inserted into a full page, the storage engine splits the page into two pages to accommodate the row. This is a page split operation. Split pages may cause tables to occupy more disk space.
  5. Clustered indexes may slow the full table scan, especially when the rows are sparse or data storage is not continuous due to page splitting.
  6. Secondary indexes (non-clustered indexes) may be larger than expected, because the leaf node of the secondary index contains the primary key column of the referenced row.
  7. Secondary index access requires two index searches instead of one.

Why do secondary indexes require two index searches? Because the leaf node of the secondary index saves not the pointer to the physical location of the row, but the primary key value of the row. This means that the storage engine needs to find the leaf node of the secondary index to obtain the corresponding primary key value, and then find the corresponding row in the clustered index based on the value. Repeat the operation here: Two B-Tree searches instead of one.

Data Distribution comparison between InnoDB and MyISAM

The data distribution of clustered indexes and non-clustered indexes is different, and the data distribution of primary key indexes and secondary indexes is also different. Use the following table for testing:

Create table layout_test (
Col1 int not null,
Col2 int not null,
Primary key (col1 ),
KEY (col2)
);

Assume that the primary key value of the table is 1 ~ 10 000, inserted in random order and optimized using the optimize table command. In other words, the data storage method on the disk is already optimal, but the row order is random. The value of column col2 is from 1 ~ A random value is assigned between 100, so there are many repeated values.

Data Distribution of MyISAM. The data distribution of MyISAM is very simple. It is stored on the disk in the order of data insertion ,:

As can be seen from 5-4, the line number is displayed next to the row, increasing from 0. Because the rows are fixed, MyISAM can skip the required bytes from the beginning of the table to find the required rows (MyISAM does not always use the "row number" in Figure 5-4 ", but use different policies based on the fixed length or variable length rows ).

This distribution method is easy to create indexes. The following figures hide the physical details of the page. Only the "nodes" in the index are displayed. Each leaf node in the index contains a "row number ". Figure 5-5 shows the table's primary key.

Shows the index of col2:

As shown in Figure 5-6, we can see that the index of col2 is no different from other indexes. In fact, the primary key index in MyISAM is no different from other indexes in structure. A primary key index is a unique non-empty index named PRIMARY.

Data Distribution of InnoDB. Because InnoDB supports clustered indexes, the same data is stored in different ways. InnoDB stores data as shown in 5-7.

As you can see, each leaf node in the clustered index of InnoDB contains the primary key value, transaction ID, rollback pointer for transactions and MVCC, and all the remaining columns (in this example, col2 ). If the primary key is a column prefix index, InnoDB also contains the complete primary key column and other remaining columns.

Another difference from MyISAM is that InnoDB's secondary index and clustered index are very different. The leaf node of the InnoDB secondary index does not store the "Row Pointer", but the primary key value and uses it as the "Pointer" to the row ".

This policy reduces the maintenance of secondary indexes when rows move or data pages are split. Using the primary key value as a pointer will occupy more space for the secondary index. The advantage is that InnoDB does not need to update the "Pointer" in the secondary index when moving data ".

Figure 5-8 shows the table's col2 index. Each leaf node contains an index column (col2) followed by a primary key value (col1 ).

Figure 5-9 illustrates how InnoDB and MyISAM store the abstract graph of a table. It is easy to see the differences between InnoDB and MyISAM store data and indexes.

 

Insert rows in the order of primary keys in the InnoDB table

It is best to avoid using random (discontinuous and with a very large value distribution range) columns for clustering indexes (int-Type Auto-incrementing IDS can be used), especially for I/O-intensive applications. For example, using UUID as a clustered index can be very bad: it makes the insertion of clustered index completely random, which is the worst case, so that data has no clustering feature.

Next we will use two tables for benchmarking. The first table inserts the userinfo table with an integer ID:

Create table 'userinfo '(
'Id' int unsigend not null AUTO_INCREMENT,
'Name' varchar (64) not null default '',
'Email 'varchar (64) not null default '',
'Password' varchar (64) not null default '',
'Dob' date default null,
'Address' varchar (255) not null default '',
'City' varchar (64) not null default '',
'State _ id' tinyint unsigend not null default '0 ',
'Country _ id' smallint unsigend not null default '0 ',
Primary key (id ),
Unique key email (email ),
KEY country_id (country_id ),
KEY state_id (state_id ),
KEY state_id_2 (state_id, city, address)
) ENGINE = InnoDB

The second example is the userinfo_uuid table. Except that the primary key is changed to UUID, the others are exactly the same as the preceding userinfo table.

1 CREATE TABLE `userinfo`(2     `uuid` varchar(36) NOT NULL,3     ...4 );

Now we have created two test tables. Next we will insert 1 million records in sequence. Insert 3 million records in sequence to make the index size exceed the server memory capacity. The result is as follows:

Note that inserting rows into UUID not only takes longer time, but also takes more space for the index. This is because the primary key field is longer and the page is split and fragmented.

Figure 5-10 shows the index changes when data is inserted into the first table.

From 5 to 10, we can see that because the values of the primary keys are sequential, InnoDB stores each record behind the previous record. When the maximum page fill factor is reached (InnoDB's default maximum fill factor is 15/16 of the page size, leaving some space for future modification), the next record will be written to a new page. Once the data is loaded in this order, the primary key page will be approximately filled by sequential records (the secondary index page may be different ).

Figure 5-11 shows the index changes when data is inserted into the UUID table.

As we can see from Figure 5-11, because the primary key value of the new row is not necessarily larger than the previous one, InnoDB cannot simply insert the new row to the end of the index, instead, you need to find a suitable location for the new row-usually the intermediate location of existing data-and allocate space. This adds a lot of extra work and leads to insufficient data distribution optimization. Below are some shortcomings summarized:

  1. The target page to be written may have been flushed to the disk and removed from the cache, or it has not been loaded into the cache, innoDB has to find and read the target page from the disk to the memory before it is inserted. This will lead to a large number of random I/O.
  2. Because the write operations are out of order, InnoDB has to split pages frequently to allocate space for new rows. Split pages may cause a large amount of data to be moved. A single insert operation requires at least three pages instead of one page.
  3. Due to frequent page split, the pages become sparse and irregularly filled, so the final data will be fragmented.

After loading these random values into the clustered index, you may need to perform an optimize table to recreate the TABLE and OPTIMIZE page filling.

From this case, we can see that when using InnoDB, we should try to insert data in the primary key order as much as possible, and try to use the monotonically increasing clustering key value to insert new rows.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.