MySQL index,
Clustered index is not a separate index type, but a data storage method. When a table has a clustered index, its data rows are actually stored in the index's leaf page. The term "clustering" indicates that data rows and adjacent health values are stored in a compact manner. Because data rows cannot be stored in two different places at the same time, a table can only have one clustered index.
The storage of clustered indexes is as follows:
Note that the leaf page contains all the data of the row, but the node page only contains the index column. In this figure, the index column contains an integer.
The clustered index is the primary key by default. If no primary key is defined in the table, InnoDB selects a unique non-null index instead. Without such an index, InnoDB implicitly defines a primary key as a clustered index. InnoDB only aggregates records on the same page. Pages that contain adjacent values may be far away.
Advantages of clustered index:
Disadvantages of clustered index:
Why do secondary indexes require two index searches? Because the leaf node of the secondary index saves not the pointer to the physical location of the row, but the primary key value of the row. This means that the storage engine needs to find the leaf node of the secondary index to obtain the corresponding primary key value, and then find the corresponding row in the clustered index based on the value. Repeat the operation here: Two B-Tree searches instead of one.
Data Distribution comparison between InnoDB and MyISAM
The data distribution of clustered indexes and non-clustered indexes is different, and the data distribution of primary key indexes and secondary indexes is also different. Use the following table for testing:
1 CREATE TABLE layout_test(2 col1 int NOT NULL,3 col2 int NOT NULL,4 PRIMARY KEY(col1),5 KEY(col2)6 );
Assume that the primary key value of the table is 1 ~ 10 000, inserted in random order and optimized using the optimize table command. In other words, the data storage method on the disk is already optimal, but the row order is random. The value of column col2 is from 1 ~ A random value is assigned between 100, so there are many repeated values.
Data Distribution of MyISAM. The data distribution of MyISAM is very simple. It is stored on the disk in the order of data insertion ,:
As can be seen from 5-4, the line number is displayed next to the row, increasing from 0. Because the rows are fixed, MyISAM can skip the required bytes from the beginning of the table to find the required rows (MyISAM does not always use the "row number" in Figure 5-4 ", but use different policies based on the fixed length or variable length rows ).
This distribution method is easy to create indexes. The following figures hide the physical details of the page. Only the "nodes" in the index are displayed. Each leaf node in the index contains a "row number ". Figure 5-5 shows the table's primary key.
Shows the index of col2:
As shown in Figure 5-6, we can see that the index of col2 is no different from other indexes. In fact, the primary key index in MyISAM is no different from other indexes in structure. A primary key index is a unique non-empty index named PRIMARY.
Data Distribution of InnoDB. Because InnoDB supports clustered indexes, the same data is stored in different ways. InnoDB stores data as shown in 5-7.
As you can see, each leaf node in the clustered index of InnoDB contains the primary key value, transaction ID, rollback pointer for transactions and MVCC, and all the remaining columns (in this example, col2 ). If the primary key is a column prefix index, InnoDB also contains the complete primary key column and other remaining columns.
Another difference from MyISAM is that InnoDB's secondary index and clustered index are very different. The leaf node of the InnoDB secondary index does not store the "Row Pointer", but the primary key value and uses it as the "Pointer" to the row ".
This policy reduces the maintenance of secondary indexes when rows move or data pages are split. Using the primary key value as a pointer will occupy more space for the secondary index. The advantage is that InnoDB does not need to update the "Pointer" in the secondary index when moving data ".
Figure 5-8 shows the table's col2 index. Each leaf node contains an index column (col2) followed by a primary key value (col1 ).
Figure 5-9 illustrates how InnoDB and MyISAM store the abstract graph of a table. It is easy to see the differences between InnoDB and MyISAM store data and indexes.
Insert rows in the order of primary keys in the InnoDB table
It is best to avoid using random (discontinuous and with a very large value distribution range) columns for clustering indexes (int-Type Auto-incrementing IDS can be used), especially for I/O-intensive applications. For example, using UUID as a clustered index can be very bad: it makes the insertion of clustered index completely random, which is the worst case, so that data has no clustering feature.
Next we will use two tables for benchmarking. The first table inserts the userinfo table with an integer ID:
1 CREATE TABLE `userinfo`( 2 `id` int unsigend NOT NULL AUTO_INCREMENT, 3 `name` varchar(64) NOT NULL DEFAULT '', 4 `email` varchar(64) NOT NULL DEFAULT '', 5 `password` varchar(64) NOT NULL DEFAULT '', 6 `dob` date DEFAULT NULL, 7 `address` varchar(255) NOT NULL DEFAULT '', 8 `city` varchar(64) NOT NULL DEFAULT '', 9 `state_id` tinyint unsigend NOT NULL DEFAULT '0',10 `country_id` smallint unsigend NOT NULL DEFAULT '0',11 PRIMARY KEY (id),12 UNIQUE KEY email (email),13 KEY country_id (country_id),14 KEY state_id (state_id),15 KEY state_id_2 (state_id,city,address)16 ) ENGINE = InnoDB
The second example is the userinfo_uuid table. Except that the primary key is changed to UUID, the others are exactly the same as the preceding userinfo table.
1 CREATE TABLE `userinfo`(2 `uuid` varchar(36) NOT NULL,3 ...4 );
Now we have created two test tables. Next we will insert 1 million records in sequence. Insert 3 million records in sequence to make the index size exceed the server memory capacity. The result is as follows:
Note that inserting rows into UUID not only takes longer time, but also takes more space for the index. This is because the primary key field is longer and the page is split and fragmented.
Figure 5-10 shows the index changes when data is inserted into the first table.
From 5 to 10, we can see that because the values of the primary keys are sequential, InnoDB stores each record behind the previous record. When the maximum page fill factor is reached (InnoDB's default maximum fill factor is 15/16 of the page size, leaving some space for future modification), the next record will be written to a new page. Once the data is loaded in this order, the primary key page will be approximately filled by sequential records (the secondary index page may be different ).
Figure 5-11 shows the index changes when data is inserted into the UUID table.
As we can see from Figure 5-11, because the primary key value of the new row is not necessarily larger than the previous one, InnoDB cannot simply insert the new row to the end of the index, instead, you need to find a suitable location for the new row-usually the intermediate location of existing data-and allocate space. This adds a lot of extra work and leads to insufficient data distribution optimization. Below are some shortcomings summarized:
After loading these random values into the clustered index, you may need to perform an optimize table to recreate the TABLE and OPTIMIZE page filling.
From this case, we can see that when using InnoDB, we should try to insert data in the primary key order as much as possible, and try to use the monotonically increasing clustering key value to insert new rows.
References:
High-performance MySQL (version 3rd)