Indexing principle and table design in MySQL

Last Update:2015-12-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Indexes are the basis of efficient use of the database, but when your data volume is very small, perhaps by scanning the whole table to access the performance of the data is acceptable, but when the amount of data is very large, when the traffic is very high, it is necessary through the index of the auxiliary to effectively access the data. General index establishment is good or bad performance is the key to success.

1.InnoDbData and Index storage details

The data storage structure for MySQL and clustered index SQL Server using InnoDB as a data engine is somewhat similar, although on the physical level they are stored on the page, but on the logic we can divide the data into three blocks: data region, index area, primary key region, They work with the value of the primary key as an association. By default, the size of a page is 16K.

There are many indexes in the index data region in a table data space, each index is a b+tree, the value indexed in the index's b+tree as the key of the B+tree node, and the data primary key as the value of the node.

In InnoDB, the table data file itself is an index structure organized by B+tree, and the tree's leaf node data field holds the complete data record. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary key index. This index is also called a clustered index. Because the InnoDB data file itself is clustered by the primary key, the INNODB requires that the table must have a primary key (MyISAM can not), and if it is not explicitly specified, the MySQL system automatically selects a column that uniquely identifies the data record as the primary key, and if no such column exists, Then MySQL automatically generates an implicit field for the InnoDB table as the primary key, which is 6 bytes long and has a length of type.

Table data is placed in the form of a row in a page with a size of 16K, with header information and one row of data in each data page. The page header information is mainly placed in the page data of all the primary key values and their corresponding offset, so that the primary key can quickly find its corresponding data location.

2. Principles of index-optimized retrieval

Index is the soul of the database, if there is no index, the database is a bunch of text files, the significance of the existence is not big. Indexes can improve retrieval efficiency by making the database more geometric multiples. The index of the MySQL database using InnoDB as the data engine is divided into clustered indexes (that is, primary keys) and normal indexes. We have explained the storage structure of these two indexes in the previous section, and now we are going to explain how the index works.

Both clustered and normal indexes are likely to consist of multiple fields, and we call this index a composite index, and 1.2.3 will parse the performance of this index for you.

2.1 Clustered Index

From the last section we know that all of InnoDB's data is sorted by a clustered index, which makes searching by primary key very efficient, and if we have a clustered index in the selection criteria of our SQL statement, the database takes precedence over the clustered index to retrieve the work.

Based on the value of the primary key in the condition, the database engine quickly finds the leaf node corresponding to the primary key in the B+tree, and then reads the page database where the leaf node resides to the memory and returns it to the user, such as the direction of the Green Line. Let's run a SQL and analyze it from the execution of the database:

SELECT * from up_user where userId = 10000094;

......

# query_time:0.000399 lock_time:0.000101 rows_sent:1 rows_examined:1 rows_affected:0

# bytes_sent:803 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0

# INNODB_TRX_ID:1D4D

# Qc_hit: No Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no

# Filesort:no Filesort_on_disk:no merge_passes:0

# innodb_io_r_ops:0 innodb_io_r_bytes:0 innodb_io_r_wait:0.000000

# innodb_rec_lock_wait:0.000000 innodb_queue_wait:0.000000

# Innodb_pages_distinct:2

SET timestamp=1451104535;

SELECT * from up_user where userId = 10000094;

As we can see, the database reads two page from the disk and extracts the 809Bytes data and returns it to the client.

Let's try it if the selection criteria do not include the primary key and index:

SELECT * from ' up_user ' where bigportrait = ' 5f29e883bfa8903b ';

# query_time:0.002869 lock_time:0.000094 rows_sent:1 rows_examined:1816 rows_affected:0

# bytes_sent:792 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0

# qc_hit:no full_scan:yes full_join:no tmp_table:no tmp_table_on_disk:no

# innodb_pages_distinct:25

You can see that if you use the primary key as a search condition, the retrieval time is 0.3MS, only read two page, instead of using the primary key as the retrieval condition, the retrieval time spent 2.8ms, read 25 page, the global scan to find out the record. This is also a table with only more than 1000 rows, if the larger data volume, the contrast is more intense.

For these two pages, one is the primary key b+tree data, and the other is the data page where the data is 10000094.

2.2 General Index

We use a normal index as a search condition to retrieve two times the index: Retrieve the primary key by retrieving the normal index first, and then retrieve the record with the primary key to the primary index. such as the direction of the red Line.

Let me take a look at the database performance in the following example:

SELECT * from up_user where userName = ' Fred ';

# query_time:0.000400 lock_time:0.000101 rows_sent:1 rows_examined:1 rows_affected:0

# bytes_sent:803 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0

# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no

# Innodb_pages_distinct:4

We can see that the database used 0.4MS to retrieve this data, read 4 pages, more than using the primary key as a search condition used 0.1ms, read more than two page, and these two page is username this ordinary index b+tree data page.

2.3 Composite Index

Both clustered and normal indexes are likely to consist of multiple fields, and the key for the leaf node of the index consisting of multiple fields is stitched together in sequence by the values of multiple fields. The storage structure of this index is like this, first the first field to establish a b+tree, the leaf node key is the value of the first field, the leaf node value is a small b+tree, descending progressively. For such an index, using the first field as the retrieval condition can effectively improve the retrieval efficiency. The field in the back can only be used when the field in front of him is in the search condition. Let's use an example to illustrate the situation.

We build a composite index for testing on the Up_user table, built on the (' nickname ', ' Regtime ') two fields, below we test the performance of the search:

SELECT * from Up_user where nickname= ' Fredlong ';
# query_time:0.000443lock_time:0.000101 rows_sent:1 rows_examined:1 rows_affected:0
# bytes_sent:778 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# Qc_hit:no Full_scan:noFull_join:no Tmp_table:no Tmp_table_on_disk:no
# Innodb_pages_distinct:4

We saw that the index worked, with the same effect as the normal index, with 0.43MS, reading four page to complete the task.

SELECT * from up_user where regtime = ' 2015-04-27 09:53:02 ';
# query_time:0.007076lock_time:0.000286 rows_sent:1 rows_examined:1816 rows_affected:0
# bytes_sent:803 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# Qc_hit:no Full_scan:yesFull_join:no Tmp_table:no Tmp_table_on_disk:no
# innodb_pages_distinct:26

From the implementation of this choice, although regtime in the composite index just built, still do a global scan. Because the nickname field in front of the Regtime field in the compound index does not appear in the selection criteria, we do not use this index.

So what happens when we use a composite index? I usually use the following in two cases:

when a composite index is required to drain the weight.
The results selected with the first field of the index are not accurate enough and require a second field for further performance optimizations.

I have basically not built more than three fields to do composite indexing, and if this happens, I think your table design may have chatty problems that need to be tuned at the table design level rather than by adding complex index tuning. All complex things are of great probability to be problematic.

3. Efficiency of Batch Selection

Our business is often such a request to select all messages sent by a user, all replies to this post, the userid of all users who were registered yesterday. Such a claim needs to take a batch of data from the database, rather than a piece of data, the database processing logic for this request is a little more complex, let us analyze the various situations.

3.1 Retrieving data in bulk based on primary key

We have a table pw_like, which stores all likes of the Feed, which uses Feedid and userid to do the Federated primary Key, where Feedid is the first-bit field. There is a total of 19 page data in the table. Below we choose Feedid for 11593 all likes:

SELECT * from pw_like where feedid = 11593;
# query_time:0.000478lock_time:0.000084 rows_sent:58rows_examined:58 rows_affected:0
# bytes_sent:865 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# Qc_hit:no Full_scan:noFull_join:no Tmp_table:no Tmp_table_on_disk:no
# Innodb_pages_distinct:2

We took 0.47ms to fetch 58 data, but we read only 2 page data, and one page was the primary key of the table. Note that all 58 of these data are stored in the same page. In this case, it is the most efficient case for a database.

3.2 Retrieving data in bulk based on normal index

Or just the table, we have indexed on the UserID in addition to the primary key, because sometimes it is necessary to query a user's point of praise. So let's look at the efficiency of the database only by indexing, without retrieving the bulk data through the primary key.

SELECT * from pw_like where userId = 80000402;
# query_time:0.002892lock_time:0.000062 rows_sent:27Rows_examined:27 rows_affected:0
# bytes_sent:399 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# Qc_hit:no Full_scan:noFull_join:no Tmp_table:no Tmp_table_on_disk:no
# innodb_pages_distinct:15

The result we can see is that although we've only fetched 27 pieces of data, we've read 15 data page and spent 2.8 milliseconds, although we haven't done a global scan, we've basically read the general block of data. Because the data in Pw_like is physically sorted by Feedid, these 27 data are distributed in 13 page (two page is index and primary key), so the database needs to read all the 13 page from disk, Even if there is only one piece of data (15Bytes) on a page (16K), it is necessary to read the data page to remove all target row.

Retrieving bulk data through a normal index is significantly less efficient than retrieving it by primary key. Because the data is fragmented, you need to read the data in a separate page for stitching to complete the task. But the index on the primary key is still very necessary to supplement, such as the above example, when the user reaches 1 million, the retrieval of a user point of all the likes of the cost is only about 15 page, spend about 2ms.

3.2 Retrieving data over a time range

Selecting a certain range of data is a common problem that we often encounter. Retrieving a certain range of data in a table of large amounts of data can easily cause performance problems. The most common requirements we encounter are the following:

Select Registered User information over time
At such times, time is certainly not the primary key of the user table, if the direct use of time as a selection criteria to retrieve, the efficiency will be very poor, how to solve this problem? My approach is to use the registration time as the user table index, each time the need to retrieve the two ends of the userid is poor, and then use these two userid to make the choice of the third query is the data we want. These three searches we only need to read about 10 page to solve the problem. This approach looks like a hassle, but it's the only solution when the amount of data in the table reaches billions of dollars.
Select a log for a period of time
Log table is our most common table, how to design is often talk about the topic. Log tables are poorly designed because you want to use time as the primary key, so retrieving logs over a period of time is convenient. However, the use of time as the primary key has a very large disadvantage, when the log insertion speed is very fast, there will be a primary key duplication caused by conflict.
In this case, I generally put the log generation time and a self-increment ID as the log table of the Federated primary Key, the time as the first field, so that the log inserted too fast caused by the primary key unique conflict, but also conveniently based on time to do the retrieval work, very convenient. Here is an example of this log table retrieval, and you can see that the performance is very good. What's more, the log table is better to be a table every day, so that it can be managed and retrieved more conveniently.

SELECT * from log_test where LogTime > "2015-12-27 11:53:05" and LogTime < "2015-12-27 12:03:05";
# query_time:0.001158 lock_time:0.000084 rows_sent:599 rows_examined:599 rows_affected:0
# bytes_sent:4347 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# Innodb_pages_distinct:3

4. Efficiency of Batch Selection

When we are retrieving data in bulk, we need the data to be ordered in most cases for the result of the selection. The performance of sorting is also a daily need to notice, below we are divided into three cases to analyze how the database is sorted.

For the order by primary key, the database is very happy to see, there is basically no additional performance loss, because the data is originally stored in the primary key order, take it back directly.

All of the following examples of sorting are experiments done on the Up_messagehistory table, with a total of 195 page,35417 rows of data, the primary key built on the field ID and indexed on both Senduserid and Destuserid.

First, let's do an unordered search:

select * from ' CU '. ' up_messagehistory ' where Senduserid =;
# query_time:0.016135lock_time:0.000084 rows_sent:3572 rows_examined:3572 rows_affected:0
# bytes_sent:95600 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# innodb_pages_distinct:125

Then we sort the IDs under the same conditions:

select * from ' CU '. ' up_messagehistory ' where Senduserid = the ";
# query_time:0.016259lock_time:0.000086 rows_sent:3572 rows_examined:3572 rows_affected:0
# bytes_sent:95600 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# innodb_pages_distinct:125

As you can see from the data above, the performance is almost the same as without an order by, with no additional performance loss. Next we sort the indexes:

select * from ' CU '. ' up_messagehistory ' where senduserid = Destuserid;
# query_time:0.018107lock_time:0.000083 rows_sent:3572 rows_examined:7144 rows_affected:0
# bytes_sent:103123 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# innodb_pages_distinct:125

Next we will use the ordinary string field to do the sorting and then look:

select * from ' CU '. ' up_messagehistory ' where Senduserid = the content of the order by;
# query_time:0.023611lock_time:0.000085 rows_sent:3572 rows_examined:7144 rows_affected:0
# bytes_sent:105214 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# innodb_pages_distinct:125

Then we'll use the normal number Type field to sort things out:

Java code

<strong>select * from ' CU '. ' up_messagehistory ' WHERE Senduserid = the * * * ORDER BY senttime;</strong>
<strong># query_time: 0.018522</strong> lock_time: 0.000107 rows_sent: 3572 Rows_examined : 7144 rows_affected: 0
# bytes_sent: 95709 tmp_tables: 0 tmp_disk_tables: 0 tmp_table_sizes: 0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# innodb_pages_distinct:

In view of the above experimental results, we can draw the following conclusions:

There is no performance loss for the primary key sorting operation;
Indexing does not function as an optimal sort for an indexed field that is not in the selection criteria;
Sorting fields on numeric types is much more efficient than sorting on string type fields.

In the following, we study the order of the index fields in the selection criteria, whether the database will optimize the sorting algorithm, we use the Up_user table to study.

SELECT * from Up_user where score > 10000;
# query_time:0.001470lock_time:0.000130 rows_sent:122 rows_examined:122 rows_affected:0
# bytes_sent:9559 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# innodb_pages_distinct:17

We then use the index fields in the selection criteria to sort:

SELECT * from Up_user where score > 10000 ORDER by score
# query_time:0.001407lock_time:0.000087 rows_sent:122 rows_examined:122 rows_affected:0
# bytes_sent:9559 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no# innodb_pages_distinct:17

Then we sort by using the non-indexed numeric fields in the selection criteria:

SELECT * from Up_user where score > 10000 ORDER by ' securityquestion '
# query_time:0.002017lock_time:0.000104 rows_sent:122 rows_examined:244 rows_affected:0
# bytes_sent:9657 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# innodb_pages_distinct:17

From the execution time of the three query statements above, using indexed fields to sort and not sort the time spent is about the same as the time it takes to sort the ordinary fields, so we can draw the fourth conclusion:

4. For sorting operations on indexed fields in the selection criteria, the index plays an optimal role in sorting.

5. Index maintenance

In front we can see that all the primary key and index are ordered, then the order of the matter will need to pin number resources, each time there is a new data insertion, or the old data changes, the order needs to be adjusted, which is the need for loss performance, below we analyze.

When a self-increment field is a primary key, the database has very low maintenance costs for the primary key:

Each new value added is a maximum value, appended to the last, the other data does not need to move;
This data is generally not modified.

Primary key maintenance costs are higher when you use a business field as the primary key. New data that is generated every time has the potential to reposition other data.

Because the InnoDB primary key and the data are put in a piece, each move the primary key, also need to move the data, maintenance costs will be compared, for the table that needs to write frequently, do not recommend the use of business fields as the primary key.

Since the primary key is the value of the leaf node for all indexes and is the basis for sorting the data, if the value of the primary key is modified, all related indexes need to be modified, and the order of the entire primary key B+tree needs to be modified, and the loss will be very large. Avoid frequently updating primary keys to avoid the above mentioned problems.

Update up_user Set userid = 100000945 where userId = 10000094;
# query_time:0.010916lock_time:0.000201 rows_sent:0 rows_examined:1 rows_affected:1
# bytes_sent:59 tmp_tables:0 tmp_disk_tables:0 tmp_table_sizes:0
# qc_hit:no Full_scan:no full_join:no tmp_table:no tmp_table_on_disk:no
# innodb_pages_distinct:11

As you can see from the above data, the SQL statement only modifies one piece of data, but affects 11 of the page.

Compared to the primary key, the index is much lighter, its leaf node value is the primary key, very light, low maintenance costs. However, it is not recommended to set too many indexes on a table. Maintenance of an index low cost, maintenance 8 is not necessarily low, this thing needs to be treated in a balanced manner.

6. Index Design principles

Index is actually a double-edged sword, with good multiplier, useless good, less.

There is no special case for the fields of the primary key, which must be used for numeric types, with fewer compute resources when sorting, and less space for storage. If the field value of the primary key is large, the various indexes of the entire data table become inefficient because the values of the leaf nodes of all the indexes are primary keys.

Not all indexes are valid for queries, and SQL is optimized for queries based on the data in the table, and when there is a large number of data duplication in the index columns, SQL queries may not take advantage of the index, as there are fields in the table Sex,male, female almost half, So even if you build an index on sex, it doesn't work for query efficiency.

Indexing principle and table design in MySQL

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Indexing principle and table design in MySQL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Indexing principle and table design in MySQL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support