MySQL Index Design Overview

Last Update:2017-12-06 Source: Internet

Author: User

Tags mysql query mysql index

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Designing an index in a relational database is not a complex thing, and many developers feel that designing an index can improve the performance of the database, and the knowledge involved must be very complex.

However, the idea is not correct, the index is actually not a very mysterious thing, as long as we grasp a certain method, understanding the implementation of the index can be in the case of DBA need to design efficient index.

This article introduces some methods of designing indexes in database index design and optimization, so that readers can quickly design appropriate indexes in existing projects.

Disk IO

A database must ensure that all data stored in it can be read and written at any time, and because all of the data in MySQL is actually stored on disk in the form of files, and random access to the corresponding data from disk is time consuming, Therefore, the database program and the operating system provide buffer pool and memory to improve the speed of data access.

In addition, we need to know that the database to read the data is not in the behavior of the unit, whether it is read a row or more than one row, the row or rows are loaded into the page, and then read the corresponding data records;

In MySQL, the size of the page is generally 16KB, but it can also be 8KB, 32KB, or other values, which is very much related to how the MySQL storage engine stores the data, but whether the index or row record in the cache pool greatly affects the cost of accessing the index or data.

Random Read

The cost of a database waiting for a page to be read from disk to the cache pool is huge, whether we want to read more than one piece of data on a page or a piece of data that consumes about 10ms of time:

10ms of time in computing is actually a very large cost, assuming we use the script to the SSD disk on the sequential write bytes, then within 10ms can write about 3MB of content, but the database program within 10MS can only load one page of data into the database buffer pool, From here we can see that the cost of random reading is huge.

This 10ms random reading is based on 50 reads per second, where the wait time is 3ms, the actual disk busy time is about 6ms, the final data page from the disk to the buffer pool time is 1ms, in the estimation of the query does not need to know exactly the random read time, Just know that the estimated 10ms is okay.

Memory Read

If the corresponding data page is not found in the cache pool of the database, then the corresponding page is searched in memory:

When the corresponding page exists in memory, the database program will use the in-memory page, which can reduce the reading time of the data by an order of magnitude, reduce the 10ms to 1ms;mysql when performing a read operation, will first read from the database buffer, if not present with the buffer will try to load the page from memory, If the previous two steps fail, then only random IO can be executed to get the corresponding data page from the disk.

Sequential read

Reading data from disk is not a big cost, and when the database manager reads a large amount of data sequentially from the disk, the reading speed is unusually fast, probably around 40mb/s.

If the size of a page is 4KB, then 1s will be able to read 10,000 pages, the average time to read a page is 0.1ms, compared to the random read 10ms has been reduced by two orders of magnitude, even more than in memory read the data faster.

The sequential reading of data pages has two very important advantages:

Simultaneous reading of multiple interfaces means that the total time consumed will be greatly reduced, and the throughput of the disk can reach 40mb/s;
The database Manager will pre-read some of the interfaces that are about to be used to reduce the wait and response time for query requests;

Summary

The time of database query operation is mostly consumed in the process of reading data from disk or memory, because of the huge cost of random IO, how to reduce the number of random IO in a database query can greatly reduce the time spent in querying to improve the throughput of the disk.

Query process

In the previous section, the article describes the impact of disk IO on MySQL queries from the point of view of data page loading, and in this section we will describe the impact of the characteristics of the data in the database on the performance of the final query during the execution of the MySQL query.

Indexer (index Slices)

An index slice is actually an index fragment scanned by an SQL query during execution, and the index in this range will be scanned sequentially, depending on the number of columns contained in the index slice, the database index design and optimization books are divided into wide and narrow indexes:

Primary key columns are id guaranteed to exist in all MySQL indexes.

For SELECT id, username, age FROM users WHERE username="draven" a query, (ID, username) is a narrow index, because the index does not contain an age column that exists in the SQL query, and (ID, username, age) is a wide index of the query that contains all the data columns that are required in the query.

A wide index avoids secondary random IO, and a narrow index needs to look up the corresponding data from the primary key index after sequential reading of the index, based on the primary key ID:

For narrow indexes, each row of records that is matched in the index will eventually need to perform additional random reads to obtain the remaining data from the clustered index, and if the result set is very large, it can result in too many random reads and thus affect performance.

Filter factor

From the previous section of the introduction of the index, we can see that the impact of SQL query in addition to the query itself is also related to the data characteristics in the database table, if you use a narrow index then random access to the table is inevitable, at this time how to make the index slice "thin" is what we need to do.

The index slice size of an SQL query scan is actually determined by the filter factor, which is the percentage of the number of records that satisfy the query criteria:

For the Users table, sex= "Male" is not a good filter factor, it selects half the data in the entire table, so in general we'd better not use the sex column as the first column of the entire index; name= "Draven" Can be used to get a better filter factor, its use can filter the entire data table 99.9% of the data; Of course, we can also combine these three filters to create a new index (name, age, sex) and use the three columns as a filter:

When three filters are equivalent predicates, the order of several indexed columns does not matter, and the order of the indexed columns does not affect the selection of the index for the same SQL statement, that is, the index (name, age, Sex) and (age, sex, name) are exactly the same as in the condition, Both indexes have exactly the same effect when executing a query.

The filter factor of the combination condition can reach 100,000 6, if there is 10w row of data in the whole table, it only need 6 random reading after scanning thin index slice, this kind of direct use product to calculate the filter factor of combination condition actually has one important question: there should not be too strong correlation between column and column. If there is a correlation between columns, then the result will be more than the direct product results, such as: The city and zip code has a very strong correlation, the two filter factors directly multiplied actually with the actual filter factor will be a great deviation, but this in most cases is not too big problem.

Different values also have different filter factors for the same column in a table, which results in a different value for the same column. The final query performance can also vary greatly:

When we evaluate the suitability of an index, we need to consider the performance of the query statement in extreme cases, such as 0% or 50%, the worst input often means the worst performance, the average performance of the SQL statement under the extreme input may be completely unable to work, this is the design index need to pay attention to the problem.

In summary, the size of the index slice that needs to be scanned is critical to the query performance, and the number of indexed records scanned is the product of the total number of rows and the filter factor of the combined condition, and the size of the index slice ultimately determines how long it takes to read the data from the table.

Matching columns and filtering columns

Assume that there are three secondary indexes for name, age, and (name, sex, age) in the users table, and that when there is an equivalent predicate in the WHERE condition that is similar to "draven" = 21 or "name =", they will be matched columns (Matching Co Lumn) is used to select rows of data in the index tree, but when we use the following query:

SELECT * FROM usersWHERE name = "draven" AND sex = "male" AND age > 20;

Although we have a (name, sex, age) index that contains all the columns from the above query criteria, only the name and sex two columns are the matching columns, and when MySQL executes the above query, it chooses name and sex as the matching column, scans all rows that satisfy the criteria, and then AG E As a filter column (Filtering column):

filtering columns, while not reducing the size of the index slices, can reduce the number of random reads from the table, so it plays a very important role in the index.

Design of the Index

The author believes that the previous content of the article has provided sufficient theoretical basis and knowledge for the design of the index, in general, how to reduce the number of random reads is the most important issue to design the index, in this section, we will introduce the database index design and optimization of the book summed up the design of the best index method.

Samsung Index

The Samsung index is the best possible index for a query statement, if the index of a query statement is a Samsung index, then it only needs to do a random read of the disk and a narrow index slice of the sequential scan to get all the result set, so its query response time than the normal index will be a few orders of magnitude ; According to the definition of the Samsung index in the book, we can understand that the primary key index WHERE id = 1 is a special Samsung index, we only need to index the primary key index tree once and then read a data record query is finished.

In order to meet the three stars in the Samsung index, we need to do several things separately:

The first star needs to take out all the columns in the equivalent predicate, as the initial column at the beginning of the index (in any order);
The second star needs to add the ORDER by column to the index;
The third star needs to add all the remaining columns of the query statement to the index;

Samsung index concept and star rating given from the database index design and optimization book in the fourth chapter Samsung index section.

If we design for a query statement according to the above three conditions, then we can get the query's Samsung index, the last star of the three stars is often the most easily obtained, the index of the third star is the above mentioned wide index, can avoid a lot of random IO, if we follow this order for a SQL Query Design Index Then we can get a perfect index; The three stars are actually not as simple as they seem, and each star has its own meaning:

The first star does not just add the column of the equivalent predicate to the index, it is to reduce the size of the index slice to reduce the data rows that need to be scanned;
The second star is used to avoid sorting, reducing disk IO and memory usage;
The third star is used to avoid each index corresponding to the data row need to do a random IO from the clustered index to read the remaining data;

In the actual scenario, the problem is often not so simple, although we can always avoid a large number of random access through the wide index, but in some complex queries we cannot simultaneously obtain the first star and the second star.

SELECT id, name, age FROM usersWHERE age BETWEEN 18 AND 21 AND city = "Beijing"ORDER BY name;

In the above query, we can always get the third star by increasing the column in the index, but if we want to get the first star we need to minimize the size of the index, then the index must be prefixed with (city, age), it is impossible to get a third star at this time, even if you add an index column after age Name, also because name must have a sort operation after the range Index column age, the resulting index is (city, age, name, ID):

If we need to avoid sorting in memory, we need to swap the position of age and name, and then we can get the index (city, name, age, id), and when a SQL query has both a scope predicate and an ORDER by, we have no way of getting a Samsung Indexed, what we can do is make a choice between the two, sacrificing the first star or the second star.

All in all, when designing the index of a single table, all the equivalent predicates in the query are first taken out in any order at the top of the index, and at this point, if both the scope index and the order by in the index need to be weighed, it is desirable to minimize the scanned index slice thickness. The range index column with the lowest filter factor should be indexed, and if you want to avoid sorting, select all the columns in the order by, after which only all the remaining columns in the query will be indexed, and with this fixed method and logic you can get the two-star or Samsung index of a query statement as quickly as possible.

Summarize

It is still very easy to design an index on a single table, just follow a fixed routine to design an ideal Samsung index, which strongly recommends database index design and optimization, which contains a lot of content related to index design and optimization In the following articles, readers will also analyze several of the methods of estimation provided in the introductory book to help us design more efficient indexes by estimating problems.

If you have questions about the content of the article, you can comment on the comments below the blog.

Reference

Database index design and optimization
File Space Management
Inside of Hard Drive-youtube
Hard disk working-how does a hard disk Work-hard Drive-youtube

Original link: MySQL Index design Overview · Faith-Oriented programming

Follow:draveness GitHub

MySQL Index Design Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More