Objective
Talking about MySQL query optimization, I believe you have a collection of artifice: You cannot use SELECT *, do not use null fields, reasonably create an index, select the appropriate data type for the field .... Do you really understand these optimization techniques? Do you understand how it works behind the scenes? Does performance really improve in real-world scenarios? I guess not. So it's important to understand the rationale behind these optimization recommendations, and hopefully this article will allow you to revisit these optimization recommendations and use them wisely in real business scenarios.
MySQL logical architecture
If you can build an architectural diagram of how MySQL components work together in the mind, it helps to understand MySQL server in depth. Shows the logical architecture diagram of MySQL.
MySQL logical architecture is divided into three layers, the top layer is the client tier, not unique to MySQL, such as: Connection processing, authorization authentication, security and other functions are processed at this level.
MySQL most of the core services are in the middle layer, including query parsing, analysis, optimization, caching, built-in functions (such as: time, math, encryption and other functions). The functionality of all cross-storage engines is also implemented at this level: stored procedures, triggers, views, and so on.
The bottom layer is the storage engine, which is responsible for data storage and extraction in MySQL. Similar to the file system under Linux, each storage engine has its advantages and disadvantages. The intermediate service layer communicates with the storage engine through the API, which masks the differences between different storage engines.
MySQL query process
We always want MySQL to get higher query performance, and the best way is to figure out how MySQL optimizes and executes queries. Once you understand this, you will find that many of the query optimizations actually follow some principles that allow the MySQL optimizer to run in a reasonable way as expected.
What did MySQL do when it sent a request to MySQL?
MySQL query process
Client/server-side communication protocol
The MySQL client/server communication protocol is "half duplex": At any one time, either the servers send data to the client or the client sends data to the server, and the two actions cannot occur at the same time. Once one end begins to send a message, the other end receives the entire message to respond to it, so we cannot and do not have to send a message into small pieces independently, nor do we have a way to control traffic.
The client sends the query request to the server with a separate packet, so the Max_allowed_packet parameter needs to be set when the query statement is long. However, it is important to note that if the query is too large, the server will refuse to receive more data and throw an exception.
Conversely, the server responds to the user with a lot of data, consisting of multiple packets. However, when the server responds to a client request, the client must fully receive the entire return result, rather than simply taking the previous few results and then having the server stop sending. Therefore, in the actual development, it is very good to keep the query simple and return only the necessary data, to reduce the size and quantity of packets between the communication is a great habit, which is also one of the reasons for avoiding the use of select * and adding limit restriction in the query.
Query cache
If the query cache is open before parsing a query statement, then MySQL checks whether the query is hitting the data in the query cache. If the current query hits the query cache exactly, the results in the cache are returned directly after the user's permission is checked. In this case, the query will not be parsed, the execution plan will not be generated, and it will not be executed.
MySQL stores the cache in a reference table (not understood as table, which can be thought of as a data structure similar to HashMap), and through a hash index, the hash value is computed by querying itself, the database currently being queried, the client protocol version number, and some other information that might affect the result. So the two queries that differ on any character (for example, spaces, comments) will cause the cache to not hit.
If the query contains any user-defined functions, stored functions, user variables, temporary tables, or system tables in the MySQL library, the query results are not cached. For example, the function now () or current_date () will return different query results because of different query times, such as a query that contains current_user or connecion_id () will return different results because of different users. There is no point in caching such query results.
Since it is a cache, it will expire, and when does the query cache expire? The MySQL query cache system keeps track of each table involved in the query, and if those tables (data or structure) change, all cached data associated with this table will be invalidated. Because of this, in any write operation, MySQL must set all caches of the corresponding table to be invalidated. If the query cache is very large or fragmented, this operation can result in significant system consumption, even causing the system to zombie for a while. and the additional consumption of the query cache to the system is not only in the write operation, read operation is no exception:
Any query statement must be checked before it starts, even if the SQL statement never hits the cache
If the results of the query can be cached, the results will be cached and additional system consumption will be generated when the execution is completed.
Based on this, we need to know that the query cache is not the case to improve system performance, caching and invalidation will bring additional consumption, only when the resource savings from the cache is greater than its own consumption of resources, the system will bring performance gains. But how to assess whether opening a cache can lead to performance gains is a very difficult thing to do and is not covered in this article. If the system does have some performance problems, you can try to open the query cache and do some optimizations on the database design, such as:
Replace a large table with multiple small tables, and be careful not to over-design
Bulk Insert instead of circular single insert
Reasonable control of the size of the cache space, generally its size is set to dozens of trillion more appropriate
You can control whether a query statement needs to be cached through Sql_cache and Sql_no_cache
The final advice is not to easily open the query cache, especially for write-intensive applications. If you can't help it, you can set Query_cache_type to demand, and only queries that join Sql_cache will go through the cache, and other queries will not, so you can be very free to control which queries need to be cached.
Of course the query cache system itself is very complex, and here is only a small part of the discussion, other more in-depth topics, such as: how the cache is used memory? How do I control the fragmentation of memory? What is the impact of the transaction on the query cache and so on, readers can read the relevant information on their own, right here as a point.
Syntax parsing and preprocessing
MySQL parses the SQL statement with the keyword and generates a corresponding parse tree. This process parser is mainly verified and parsed by grammatical rules. For example, whether the wrong keyword is used in SQL or the order of the keywords is correct, and so on. Preprocessing will further check that the parse tree is legitimate according to the MySQL rules. For example, check for the existence of data tables and data columns to be queried.
Query optimization
The syntax tree generated by the previous steps is considered legitimate and is converted by the optimizer into a query plan. In most cases, a query can be executed in many ways, and the result is returned. The optimizer's role is to find one of the best execution plans.
MySQL uses a cost-based optimizer that tries to predict the cost of a query when it uses some kind of execution plan, and selects one of the least expensive. MySQL can get the cost of calculating the current query by querying the value of the current session's Last_query_cost.
Mysql> select * from T_message limit 10;
... Omit result set
Mysql> Show status like ' Last_query_cost ';
+-----------------+-------------+
| variable_name | Value |
+-----------------+-------------+
| Last_query_cost | 6391.799000 |
+-----------------+-------------+
The results in the example indicate that the optimizer thinks that it is necessary to do a random lookup of 6,391 data pages to complete the above query. This result is calculated from the statistics of some columns, including the number of pages per table or index, the cardinality of the index, the length of the index and data rows, the distribution of the index, and so on.
There are a number of reasons why MySQL chooses the wrong execution plan, such as inaccurate statistics, no consideration of operating costs that are not under its control (user-defined functions, stored procedures), and MySQL thinks the best is different from what we think (we want the execution time to be as short as possible, But the MySQL value chooses it to think the cost is small, but the cost is small does not mean the execution time is short) and so on.
The MySQL query optimizer is a very complex component that uses a lot of optimization strategies to generate an optimal execution plan:
Redefining the association order of tables (when multiple tables are associated with a query, not necessarily in the order specified in SQL, but there are some tricks to specify the association order)
Optimize the Min () and Max () functions (Find the minimum value of a column, if the column has an index, you only need to find the B+tree index to the leftmost, and vice versa, you can find the maximum value, the principle is shown below)
Terminating a query prematurely (for example, when using limit, finding a result set that satisfies a number will terminate the query immediately)
Optimized sorting (in the old version, MySQL uses two transfer sorts, that is, to read the row pointers and the fields that need to be sorted in memory, and then to read the data rows based on the sort results, and the new version takes a single transmission sort, which reads all the rows of data at once, and then sorts them according to the given column. Much higher efficiency for I/O intensive applications)
With the development of MySQL, optimization strategy used by optimizer is also evolving, here are just a few of the most commonly used and easy to understand optimization strategies, and other optimization strategies, you can check it yourself.
Query execution engine
After the parsing and optimization phases are completed, MySQL generates the corresponding execution plan, and the query execution engine executes the results according to the instructions given in the execution plan. Much of the entire execution is done by invoking the interface implemented by the storage engine, which is known as the handler API. Each table in the query process is represented by a handler instance. In fact, MySQL creates a handler instance for each table during the query optimization phase, and the optimizer can get information about the table based on the interface of those instances, including all column names of the table, index statistics, and so on. The storage Engine interface provides a very rich set of functionality, but with only dozens of interfaces at the bottom, these interfaces do most of the work of a query, just like building blocks.
Returns the result to the client
The final stage of query execution is to return the results to the client. Even if the data is not queried, MySQL will still return information about the query, such as the number of rows affected by the query and the execution time.
If the query cache is opened and the query can be cached, MySQL will also store the results in the cache.
The result set returns the process that the client is an incremental and stepwise return. It's possible that when MySQL builds the first result, it starts to return the result set to the client incrementally. This allows the server to consume too much memory without having to store too many results, or to allow the client to obtain the return result the first time. It is important to note that each row in the result set is sent with a packet that satisfies the communication protocol described in ①, and then transmitted through the TCP protocol, which may cache the MySQL packets and send them in bulk during the transfer process.
Go back to MySQL. The entire query execution process is generally divided into 6 steps:
The client sends a query request to the MySQL server
The server checks the query cache first, and returns the results stored in the cache immediately if the cache is hit. Otherwise go to the next stage
The server performs SQL parsing, preprocessing, and then generating the corresponding execution plan from the optimizer
MySQL calls the storage engine's API to execute the query according to the execution plan
Return the results to the client while caching the query results
Performance Tuning Recommendations
Seeing so much, you might expect to give some optimizations, yes, and here are some suggestions for optimization from 3 different aspects. But wait, there's a piece of advice to give you first: don't listen to the "absolute truth" you see About optimization, including what is discussed in this article, but rather to test your assumptions about execution plans and response times in real business scenarios.
1
Scheme design and data type optimization
Choosing a data type is fine as long as it follows small and simple rules, smaller data types are usually faster, consume less disk, memory, and require less CPU cycles to process. Simpler data types require less CPU cycles when computing, such as integers that are less expensive than word Fu Cao, and therefore use integer types to store IP addresses, use DateTime to store time, and not use strings.
Here are a few tips that may be easy to understand wrong:
In general, changing a nullable column to not NULL does not help a performance boost, but if you plan to create an index on a column, you should set the column to NOT NULL.
Specifies the width of an integer type, such as int (11), without any OVA. int uses 32-bit (4-byte) storage space, then its representation range has been determined, so int (1) and int (20) are the same for storage and computation.
Unsigned indicates that negative values are not allowed, which can roughly increase the upper limit of a positive number by a factor. For example, the tinyint storage range is 128 ~ 127, while the unsigned tinyint storage range is 0-255.
Generally speaking, there is not much need to use the decimal data type. Even when you need to store financial data, you can still use bigint. For example, to be accurate to one out of 10,000, you can multiply the data by 1 million and then use bigint storage. This avoids the problem of inaccurate calculation of floating-point numbers and the high cost of accurate decimal calculations.
Timestamp uses 4 bytes of storage space, and datetime uses 8 bytes of storage space. Thus, timestamp can only represent 1970-2038, much smaller than the range represented by datetime, and the value of timestamp differs depending on the time zone.
In most cases, it is not necessary to use an enumeration type, one drawback is that the list of enumerated strings is fixed, and the addition and deletion of strings (enumeration options) must use ALTER TABLE (if you only append elements at the end of the list, you do not need to rebuild the table).
There are not too many columns in the schema. The reason is that the storage engine's API works by copying data between the server layer and the storage engine layer in a row-buffered format, and then decoding the buffered content into columns at the server level, the cost of this conversion process is very high. If you have too many columns and you actually use a very small number of columns, you may incur excessive CPU usage.
Large table ALTER TABLE is very time-consuming, and MySQL performs most of the table result operations by creating an empty table with the new structure, inserting all the data from the old table into the new table, and then deleting the old table. This is especially true when the memory is low and the table is large, and there are large indexes that take longer. Of course, there are some artifice can solve this problem, interested in self-check.
2
Create high-performance indexes
Indexes are an important way to improve the performance of MySQL queries, but too many indexes can result in high disk usage and high memory consumption, affecting the overall performance of your application. You should try to avoid having to remember to add an index, because you might need to monitor a lot of SQL afterwards to locate the problem, and the time to add the index is much greater than the time it takes to initially add the index, and the addition of the visible index is very technical.
The next step is to show you a series of strategies for creating high-performance indexes, and how each strategy works behind it. But before we do, it will be helpful to understand some of the algorithms and data structures associated with the index to get a better understanding of what's going on.
3
Index-related data structures and algorithms
Usually the index we refer to is the B-tree index, which is the most commonly used and valid index for finding data in the current relational database, which is supported by most storage engines. The term b-tree is used because MySQL uses this keyword in CREATE TABLE or other statements, but in fact different storage engines may use different data structures, such as InnoDB is the b+tree used.
B in b+tree refers to balance, meaning balance. It should be noted that the B + Tree index does not find a specific row for a given key value, it finds only the page where the data row is found, and the database reads the page into memory, finds it in memory, and finally gets the data to find.
Before introducing B+tree, look at binary search tree, it is a classic data structure, its left subtree value is always small root value, right subtree value is always greater than the root value, such as ①. If you want to find a record with a value of 5 in this lesson tree, its approximate flow: find the root first, its value is 6, greater than 5, so find the left subtree, find 3, and 5 is greater than 3, and then find the right subtree 3, a total of 3 times. In the same way, if you look for a record with a value of 8, you also need to look up 3 times. So the average number of lookups for binary search Trees is (3 + 3 + 3 + 2 + 2 + 1)/6 = 2.3 times, while sequential lookups find records with a value of 2, only 1 times, but a record with a value of 8 requires 6 times, so the average number of lookups for sequential lookups is: (1 + 2 + 3 + 4 + 5) + 6)/6 = 3.3 times, so in most cases the average search speed of the two fork lookup tree is faster than the order lookup.
Binary search tree and balanced binary tree
Because the binary search tree can be arbitrarily constructed, the same value can be constructed out of the ② two-fork search tree, it is obvious that the binary tree query efficiency and the order to find the same. If you want the binary lookup number of query performance is the highest, need this binary search tree is balanced, that is, Balance binary tree (AVL tree).
The balanced binary tree first needs to conform to the definition of binary search tree, and second must satisfy the height difference of two subtree of any node cannot be greater than 1. Obviously the figure ② does not meet the definition of the balanced binary tree, and Figure ① is a lesson in balancing the binary tree. Balanced binary Tree lookup performance is relatively high (the best performance is the optimal binary tree), the better the query performance, the greater the cost of maintenance. A balanced binary tree than ①, when the user needs to insert a new value of 9 nodes, the following changes need to be made.
Balanced binary Tree rotation
It is easiest to turn the inserted tree back into a balanced binary tree with a single left-hand operation, which may need to be rotated several times in the actual application scenario. At this point we can consider a problem, balanced binary tree search efficiency is also good, the implementation is very simple, the corresponding maintenance costs can also be accepted, why MySQL index does not directly use the balanced binary tree?
As the data in the database increases, the size of the index itself increases, and it is not possible to store it all in memory, so the index is often stored in the form of an index file on the disk. In this way, the disk I/O consumption is generated during the index lookup process, and the I/O access consumes several orders of magnitude relative to memory access. Can you imagine the depth of a millions of-node two-fork tree? If a binary tree of this depth is placed on a disk, and each node is read, the I/O read of the disk is required, and the time-consuming of the whole lookup is obviously unacceptable. So how do I reduce I/o access times in the lookup process?
An effective solution is to reduce the depth of the tree, turning the two-fork tree into M-fork tree (multi-search tree), and B+tree is a multi-path search tree. When you understand b+tree, you only need to understand the two most important features: first, all the keywords (which can be understood as data) are stored on leaf nodes (Leaf page), and non-leaf nodes (Index page) do not store real data. All record nodes are in the order of key value sizes placed on the same leaf node. Second, all leaf nodes are connected by pointers. such as the simplified b+tree for a height of 2.
Simplified B+tree
How do you understand these two characteristics? MySQL sets the size of each node to an integer multiple of one page (explained below), which means that each node can store more internal nodes in the case of a certain amount of node space, so that the range of indexes can be larger and more accurate. The advantage of using pointers for all leaf nodes is that they can be accessed in intervals, for example, if you find a record that is greater than 20 and less than 30, you only need to find the node 20, you can traverse the pointer to find 25, 30. If you don't have a link pointer, you won't be able to find the interval. This is also an important reason for MySQL to use B+tree as the index storage structure.
Why MySQL sets the node size to an integer multiple of the page requires an understanding of how the disk is stored. The disk itself access is much slower than main memory, in addition to mechanical movement loss (especially ordinary mechanical hard disk), disk access speed is often a few one out of 10,000 of main memory, in order to minimize disk I/O, the disk is often not strictly on-demand read, but each time will be read, even if only a byte, The disk will also start from this location, sequentially reading a certain length of data into memory, the length of the pre-read is generally the number of pages of multiples.
Page is the logical block of Computer Management memory, hardware and OS tend to divide main memory and disk storage into contiguous chunks of equal size, each storage block is called a page (many OS, the page size is usually 4K). Main memory and disk exchange data on a per-page unit. When the program to read the data is not in main memory, will trigger a page fault, the system will send a read signal to the disk, the disk will find the starting position of the data and sequentially read one or several pages back into memory, and then return together, the program continues to run.
MySQL cleverly utilizes the principle of disk read-ahead to set the size of a node equal to one page so that each node can be fully loaded with only one I/O. To achieve this, each time a new node is created, the space of a page is requested directly, so that a node is physically stored in a page, and the computer storage allocation is page-aligned, the realization of reading a node only one time I/O. Assuming that the height of the b+tree is H, a maximum of one retrieval requires H-1I/O (root node resident memory), complexity $o (h) = O (\log_{m}n) $. In practical scenarios, M is usually larger, often more than 100, so the height of the tree is generally small, usually no more than 3.
Finally, a simple understanding of the operation of the B+tree node, the overall maintenance of the index has a general understanding, although the index can greatly improve query efficiency, but the maintenance of the index still costs a great price, so it is particularly important to create a reasonable index.
Still taking the tree above as an example, we assume that each node can store only 4 internal nodes. First, you insert the first node 28, as shown in.
Leaf page and index page are not full
Then insert the next node 70, in the index page after the query that should be inserted into the leaf node between 50-70, but the leaf node is full, this time it is necessary to split the operation, the current leaf node starting point is 50, so according to the median value to split the leaf node, as shown in.
Leaf Page Split
Finally insert a node 95, when the index page and leaf page are full, you need to do two splits, as shown in.
Leaf page and Index page split
After the split, it eventually formed a tree.
Final tree
B+tree in order to maintain the balance, a large number of split page operations are required for the newly inserted values, and the split of the page requires I/O operations, and in order to minimize the splitting of the page, B+tree also provides a rotation function similar to the balanced binary tree. When the leaf page is full but its left and right sibling nodes are not full, b+tree is not eager to do a split operation, but instead moves the record to the sibling node of the page on which it is currently located. Normally, the left sibling is checked for rotation first. For example, the second example above, when inserting 70, does not do the page splitting, but the left-handed operation.
L-Operation
The rotation operation minimizes page splitting, reducing the I/O operations of the disk during index maintenance and improving index maintenance efficiency. It is important to note that the delete node is similar to the insertion node and still requires rotation and split operations, which are not explained here.
High performance strategy
By the above, I believe you have a general understanding of the data structure of b+tree, but how does the index in MySQL organize the storage of the data? In a simple example, if you have the following data table:
CREATE TABLE People (
last_name varchar () NOT NULL,
first_name varchar () NOT NULL,
DOB date NOT NULL,
Gender enum (' m ', ' F ') is not NULL,
Key (LAST_NAME,FIRST_NAME,DOB)
);
For each row of data in the table, the index contains values for the last_name, first_name, DOB columns, showing how the index organizes the data store.
How indexes organize data storage from: high-performance MySQL
As you can see, the index is first sorted according to the first field, and when the name is the same, the third field, the date of birth, is the only reason why the index has the "leftmost principle".
1. MySQL does not use the index: non-independent columns
A "stand-alone column" means that an indexed column cannot be part of an expression or a parameter to a function. Like what:
SELECT * from where ID + 1 = 5
It is easy to see that it is equivalent to ID = 4, but MySQL cannot parse the expression automatically, and using the function is the same.
2 . Prefix index
If the column is long, it is usually possible to index some of the characters at the beginning, which can effectively save the index space and improve the efficiency of the index.
3 . Multi-column index and index order
In most cases, establishing a separate index on multiple columns does not improve query performance. The reason is very simple, MySQL does not know which index to choose which query is more efficient, so in the old version, such as MySQL5.0 before a random selection of the index of a column, and the new version will take the strategy of merging indexes. For a simple example, in a movie actor table, a separate index is established on the actor_id and film_id two columns, followed by the following query:
Select film_id,actor_id from film_actor where actor_id = 1 or film_id = 1
Older versions of MySQL randomly select an index, but the new version does the following optimizations:
Select film_id,actor_id from film_actor where actor_id = 1
UNION ALL
Select film_id,actor_id from film_actor where film_id = 1 and actor_id <> 1
When multiple indexes do intersect (multiple and conditions), it is generally better to have an index that contains all the related columns than for multiple independent indexes.
When multiple indexes are combined (multiple or conditions), merging, sorting, and so on result sets requires a lot of CPU and memory resources, especially when some of these indexes are not highly selective and need to be returned when merging large amounts of data. So it's better to take a full table scan in this case.
Therefore, if an index merge is found (the extra field appears with a using union), it should be checked to see if the query and table structure is already optimal, if the query and table are not explain, it can only indicate that the index is very bad, and should carefully consider whether the index is appropriate, It is possible that a multi-column index that contains all related columns is more appropriate.
Before we talked about how indexes organize data storage, when you can see multiple-column indexes, the order of the indexes is critical to the query, and it is clear that a higher-selectivity field should be placed before the index so that most of the non-conforming data can be filtered out by the first field.
Index selectivity refers to the ratio of distinct index values to the total number of records in a data table, and the higher the selectivity, the more efficient the query, because the higher the selectivity of the index allows MySQL to filter out more rows at query time. The selectivity of the unique index is 1, when the best index selectivity, performance is also the best.
After understanding the concept of index selectivity, it is not difficult to determine which field is more selective, and to find out, for example:
SELECT * from payment where staff_id = 2 and customer_id = 584
Should I create an index (STAFF_ID,CUSTOMER_ID) or should I reverse the order? Execute the following query, which field is more selective than 1, which is the right index before the field is indexed.
Select COUNT (Distinct staff_id)/count (*) as staff_id_selectivity,
COUNT (distinct customer_id)/count (*) as customer_id_selectivity,
Count (*) from payment
There is no problem with using this principle in most cases, but keep in mind that there are some special cases in your data. For example, to query for user information that has been traded under a user group:
Select user_id from trade where user_group_id = 1 and trade_amount > 0
MySQL has selected an index (user_group_id,trade_amount) for this query, which seems to have no problem if special circumstances are not considered, but the fact is that most of the data in this table is migrated from the old system, because the data of the old and new system is incompatible, So the data migrated to the old system is given a default user group. In this case, the number of rows scanned by the index is basically no different from a full table scan, and the index does not have any effect.
Generalization says that rules of thumb and inference are useful in most cases to guide our development and design, but the reality is often more complex, and some special situations in real business scenarios may destroy your entire design.
4 . Avoid multiple range conditions
In actual development, we often use multiple scope conditions, such as looking for users who have logged in for a certain time period:
Select user.* from user where login_time > ' 2017-04-01 ' and age between and 30;
This query has a problem: it has two scope conditions, login_time column and age column, and MySQL can use the index of the Login_time column or the index of the age column, but cannot use them at the same time.
5 . Overlay index
If an index contains or overrides the values of all fields that need to be queried, then there is no need to return to the table query, which is known as the Overwrite index. Overwriting an index is a very useful tool that can greatly improve performance because the query only needs to scan the index for a number of benefits:
Index entries are much smaller than the data row size, which greatly reduces the amount of data accessed if the index is read only
Indexes are stored in the order of column values, which is much less than the IO for I/O intensive range queries that are randomly reading each row of data from disk
6. Use index Scan to sort
MySQL has two ways to produce ordered result sets, one is to sort the result set, and the other is to order the results by the index sequence. If the value of the type column in the result of explain is index, the index scan is used to do the sorting.
Scanning the index itself is fast because only one index record needs to be moved to the next adjacent record. However, if the index itself does not overwrite all the columns that need to be queried, it is necessary to query the corresponding row at a time with each index record scanned. This read operation is basically random I/O, so it is often slower to read data in indexed order than in sequential full table scans.
When designing an index, it is best if an index satisfies both the ordering and the query.
The index can be used to sort the results only if the column order of the index is exactly the same as the ORDER BY clause, and if all columns have the same sort direction. If the query requires more than one table to be associated, the index can be used for sorting only if the fields referenced by the ORDER BY clause are all of the first table. The ORDER BY clause is the same as the limit of the query, which satisfies the requirement of the leftmost prefix (with one exception, that the leftmost column is specified as a constant, the following is a simple example), and in other cases it is necessary to perform a sort operation instead of using the index ordering.
Leftmost column constant, index: (date,staff_id,customer_id)
Select staff_id,customer_id from demo where date = ' 2015-06-01 ' ORDER by staff_id,customer_id
7 . Redundant and repetitive indexes
Redundant indexes are indexes of the same type that are created in the same order on the same column, and should be avoided as soon as they are found. For example, there is an index (a, b), and the index (a) is a redundant index. Redundant indexes often occur when a new index is added to a table, such as when someone creates a new index (a, b), but the index is not an extension of an existing index (A).
In most cases, you should try to extend an existing index instead of creating a new index. However, there are very few performance considerations that require redundant indexing, such as extending an existing index and causing it to become too large, affecting other queries that use that index.
8 . Delete long-unused indexes
It is a good practice to periodically delete unused indexes for long periods of time.
About indexing the topic is going to stop here, and finally, the index is not always the best tool, and the index is valid only if the index helps improve query speed more than the extra work it brings. For very small tables, simple full-table scans are more efficient. Indexes are very effective for medium to large tables. For very large tables, the cost of establishing and maintaining indexes increases, and other technologies may be more efficient, such as partitioned tables. Finally, it is a virtue to explain after the test.
Specific types of query optimizations
Optimize count () queries
COUNT () is probably the most misunderstood function, it has two different functions, one is to count the number of a column value, and the other is to count the number of rows. When a column value is counted, the column value is required to be non-null, and it does not count null. If you confirm that an expression in parentheses cannot be empty, you are actually counting the number of rows. The simplest thing to do is that when you use COUNT (*), it's not as if you're expanding into all of the columns as we think, in fact, it ignores all the columns and directly counts all the rows.
Our most common misconception is here, where a column is specified in parentheses but it is expected that the result is a number of rows, and it is often mistaken for the performance of the former to be better. But this is not the case, if you want to count the number of rows, use COUNT (*) directly, meaning clearly and with better performance.
Sometimes some business scenarios do not require a fully accurate count value, can be substituted with approximate value, explain out the number of rows is a good approximation, and the execution of explain does not need to really go to execute the query, so the cost is very low. In general, the execution count () requires a large number of rows to be scanned for accurate data, so it is difficult to optimize, and the MySQL plane can only do that by overwriting the index. If the problem is not resolved, it can only be addressed from an architectural level, such as adding a summary table, or using an external caching system such as Redis.
Optimizing Associated Queries
In a big data scenario, a table is associated with a table through a redundant field, which has better performance than using join directly. If you do need to use an associated query, you need to be particularly careful about:
Make sure that there are indexes on the columns in the on and using words. When creating an index, consider the order of the associations. When table A and table B are associated with column C, if the order of the optimizer associations is a, B, then you do not need to create an index on the corresponding column of table A. Unused indexes can be an additional burden, generally, unless there are other reasons to create an index on the corresponding column of the second table in the association order (for specific reasons, below).
Make sure that any expressions in group by and order by only involve columns in one table, so that MySQL can use indexes to optimize.
To understand the first trick of optimizing an associated query, you need to understand how MySQL executes the associated query. The current MySQL association execution strategy is very simple, and it performs a nested Loop association operation on any association, that is, looping through a single data in a table, then looking for matching rows in the nested loop to the next table, and then down until the matching behavior in all tables is found. The columns that are required in the query are then returned based on the rows that match each table.
Is it too abstract? This is illustrated by the example above, such as a query:
SELECT a.xx,b.yy
From A INNER JOIN B USING (c)
WHERE a.xx in (5,6)
Assuming that MySQL is associated with a and b in the Order of association in the query, the following pseudocode can be used to indicate how MySQL completes the query:
Outer_iterator = SELECT a.xx,a.c from A WHERE a.xx in (5,6);
Outer_row = Outer_iterator.next;
while (Outer_row) {
Inner_iterator = SELECT b.yy from B WHERE B.C = outer_row.c;
Inner_row = Inner_iterator.next;
while (Inner_row) {
OUTPUT[INNER_ROW.YY,OUTER_ROW.XX];
Inner_row = Inner_iterator.next;
}
Outer_row = Outer_iterator.next;
}
As you can see, the outermost query is based on the a.xx column, and if there is an index on the A.C, the entire associated query will not be used. Looking at the inner query, it is obvious that if there is an index on the B.C, you can speed up the query, so you only need to create an index on the corresponding column of the second table in the association order.
Optimize Limit Paging
When paging is required, a limit plus an offset is usually used, plus an appropriate ORDER BY clause. If there is a corresponding index, usually the efficiency is good, otherwise, MySQL will need to do a lot of file sorting operations.
A common problem is when the offset is very large, such as: LIMIT 10000 20 Such a query, MySQL needs to query 10,020 records and then return only 20 records, the preceding 10,000 will be discarded, the cost is very high.
One of the simplest ways to optimize this query is to use the overwrite index scan as much as possible instead of querying all the columns. Then make an association query as needed and return all the columns. For large offsets, the efficiency of doing so is greatly increased. Consider the following query:
SELECT film_id,description from film ORDER by title LIMIT 50, 5;
If the table is very large, the query is best changed to look like this:
SELECT film.film_id,film.description
From film INNER JOIN (
SELECT film_id from film ORDER by title LIMIT 50,5
) as TMP USING (film_id);
The delayed correlation here will greatly improve query efficiency, allowing MySQL to scan as few pages as possible, obtaining the columns needed to query the original table based on the associated columns when the records need to be accessed.
Sometimes if you can use bookmarks to record where the data was last taken, the next time you can start scanning directly from the location of the bookmark record, you can avoid using offset, such as the following query:
SELECT ID from T LIMIT 10000, 10;
Switch
SELECT ID from t WHERE ID > 10000 LIMIT 10;
Other optimizations include using a pre-computed summary table, or linking to a redundant table that contains only primary key columns and columns that need to be sorted.
Optimize Union
The strategy for MySQL to process union is to create a temporary table, then insert the results of each query into a temporary table, and then make the query. So many optimization strategies do not have a good time in the union query. It is often necessary to manually "push" the Where, LIMIT, order by, and so on into each subquery so that the optimizer can take advantage of these conditions before optimizing.
Unless you really need the server to go heavy, you must use the union all, if not the all keyword, MySQL will give the temporary table plus the distinct option, which will cause the entire temporary table data to do a unique check, this is very expensive. Of course, even with the all keyword, MySQL always puts the result in a temporary table and then reads it back to the client. Although there are times when this is not necessary, for example, you can sometimes directly return the results of each subquery to the client.
Conclusion
Understanding how queries are executed and where time is consumed, coupled with knowledge of the optimization process, can help you understand MySQL better and understand the rationale behind common optimization techniques. It is hoped that the principles and examples in this paper can help us to connect theory and practice better, and apply theoretical knowledge to practice more.
There is nothing else to say, to leave two study questions for everyone, you can think of the answer in the head, which is also often put on the mouth, but very few people will think why?
There are a lot of programmers who share the idea of not using stored procedures as much as possible, that the stored procedures are very difficult to maintain, that they increase the cost of use, and that the business logic should be placed on the client. Now that the client is capable of these things, why do you want to store the procedure?
Join itself is very convenient, direct query is good, why do you need a view?
Word Summary: Learn the principles of MySQL optimization, this article is enough!