I have to tell you the MySQL Optimization Principle and the MySQL optimization principle.

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article Reprinted from http://www.jianshu.com/p/d7665192aaaf

Speaking of MySQL query optimization, I believe you have accumulated a bunch of skills: you cannot use SELECT *, do not use NULL fields, create indexes properly, and SELECT a proper data type for fields ..... do you really understand these optimization skills? Do you understand the working principle behind it? Is the performance actually improved in actual scenarios? I don't think so. Therefore, it is particularly important to understand the principles behind these optimization suggestions. I hope this article will allow you to review these optimization suggestions and make rational use of them in actual business scenarios. MySQL logical architectureIf you can build a structural diagram of how MySQL components work together in your mind, it will help you understand the MySQL server in depth. Shows the logical architecture of MySQL.

MySQL logic architecture, from: High-Performance MySQL logic architecture is divided into three layers, the top layer is the Client layer, not unique to MySQL, such: connection processing, authorization authentication, security, and other functions are all processed at this layer. Most of MySQL's core services are in the middle layer, including query parsing, analysis, optimization, caching, and built-in functions (such as time, mathematics, encryption, and other functions ). All cross-storage engine functions are also implemented at this layer: stored procedures, triggers, views, and so on. The bottom layer is the storage engine, which stores and extracts data in MySQL. Similar to the file system in Linux, each storage engine has its own advantages and disadvantages. The intermediate service layer communicates with the storage engine through APIS, Which shield the differences between different storage engines. In the MySQL Query Process, we always hope that MySQL can achieve higher query performance. The best way is to find out how MySQL optimizes and executes queries. Once you understand this, you will find that a lot of query optimization work is actually following some principles so that the MySQL optimizer can run in a reasonable way as expected. When I send a request to MySQL, what does MySQL do?

During MySQL query, the Client/Server Communication Protocol MySQL client/server communication protocol is "Half Duplex": at any time, either the server sends data to the client or the client sends data to the server, these two actions cannot occur simultaneously. Once one end starts to send a message, the other end needs to receive a complete message to respond to it. Therefore, we cannot and do not need to cut a message into small pieces for independent sending, and there is no way to control the traffic. The client sends the query request to the server using a separate data packet. Therefore, when the query statement is long, you need to set the max_allowed_packet parameter. However, if the query is too large, the server rejects receiving more data and throws an exception. On the contrary, the server usually responds to a large amount of data, which is composed of multiple data packets. However, when the server responds to a client request, the client must completely receive the entire returned result, instead of simply taking the first few results, and then let the server stop sending. Therefore, in actual development, it is a good habit to keep the query simple and only return the necessary data, and reduce the size and quantity of data packets between communications, this is also one of the reasons for avoiding the use of SELECT * in queries and adding the LIMIT on LIMIT. Before resolving a query statement, if the query cache is opened, MySQL checks whether the query statement hits the data in the query cache. If the current query hits the query cache, the results in the cache are directly returned after a user permission check. In this case, the query will not be parsed, the execution plan will not be generated, or executed. MySQL stores the cache in a reference table (not to be understood as a table, but can be considered as a data structure similar to HashMap) and uses a hash value index, the hash value is calculated based on information that may affect the query result, such as the query itself, the database to be queried, and the client Protocol version number. Therefore, the difference between the two queries on any character (such as space and comment) will result in no hit in the cache. If the query contains any user-defined functions, storage functions, user variables, temporary tables, and system tables in the mysql database, the query results will not be cached. For example, the function NOW () or CURRENT_DATE () will return different query results for different query times, for example, including CURRENT_USER or CONNECION_ID () the query statement of will return different results for different users. It does not make any sense to cache such query results. Since it is a cache, it will expire. When will the query cache become invalid? The query Cache System of MySQL tracks every table involved in the query. If these tables (data or structure) change, all the cache data associated with this table will become invalid. Because of this, MySQL must set all the caches of the corresponding table to invalid during any write operation. If the query cache is very large or has a lot of fragments, this operation may cause a large amount of system consumption, or even cause the system to crash for a while. In addition, the extra consumption of the query cache to the system is not only in the write operation, but also in the read operation:

Any query statement must be checked before it starts, even if the SQL statement never hits the cache.
If the query results can be cached, the results will be saved to the cache after execution, which will also cause additional system consumption.

Based on this, we need to know that query cache will not always improve system performance, and cache and invalidation will cause additional consumption, only when the resource saved by the cache is greater than the resource consumed by the cache will the system performance be improved. However, it is very difficult to assess whether enabling the cache can bring about performance improvement. It is not covered in this article. If the system does have some performance problems, you can try to open the query cache and make some optimizations in the database design, such:

Replace a large table with multiple small tables. Be careful not to design too much.
Batch insert replaces single insert Loop
Reasonably control the cache space size. Generally, the size is set to dozens of megabits.
You can use SQL _CACHE and SQL _NO_CACHE to control whether a query statement needs to be cached.

The last piece of advice is not to easily open the query cache, especially for write-intensive applications. If you cannot help it, you can set query_cache_type to DEMAND. In this case, only queries added to SQL _CACHE will go through the cache, and other queries will not. In this way, you can freely control which queries need to be cached. Of course, the query cache system itself is very complex. This is only a small part of the discussion. Other more in-depth topics, such as: how does cache use memory? How to control memory fragmentation? What is the impact of transactions on the query cache? You can read the relevant information on your own. Syntax Parsing and preprocessing MySQL parses SQL statements using keywords and generates a corresponding parsing tree. The parser verifies and parses the process based on the syntax rules. For example, whether the SQL statements contain incorrect keywords or whether the order of the keywords is correct. Pre-processing will further check whether the parsing tree is legal according to MySQL rules. For example, check whether the data table and data column to be queried exist. The syntax tree generated by the preceding steps for query optimization is considered legal and is converted into a query plan by the optimizer. In most cases, a query can be executed in many ways, and the corresponding results are returned. The optimizer is used to find the best execution plan. MySQL uses a cost-based optimizer, which attempts to predict the cost of a query using an execution plan and selects the least cost. In MySQL, You can query the value of last_query_cost in the current session to calculate the cost of the current query. The results in the Mysql sample code indicate that the optimizer considers that the above query can be completed only by performing a random search of 6391 data pages. This result is calculated based on the statistical information of some columns. The statistical information includes: the number of pages of each table or index, Index base, index and data row length, index distribution, and so on. There are many reasons that may lead to incorrect execution plans for MySQL. For example, the statistics are inaccurate and the operating costs (User-Defined Functions and stored procedures) that are not controlled by MySQL are not considered) mySQL considers the best execution time to be different from what we think (we want the execution time to be as short as possible, but the MySQL value thinks it is a low cost, but the low cost does not mean that the execution time is short. The MySQL query optimizer is a complex component. It uses many optimization policies to generate an optimal execution plan:

Redefinition of the Association Sequence of tables (when multiple tables are joined for query, they are not necessarily performed in the order specified in SQL, but there are some tips to specify the Association Sequence)
Optimize the MIN () and MAX () functions (find the minimum value of a column. If the column has an index, you only need to find the leftmost end of the B + Tree index. Otherwise, you can find the maximum value, see the following for details)
Early query termination (for example, when Limit is used, the query will be terminated immediately after a sufficient number of result sets are found)
Optimize sorting (in earlier versions, MySQL uses two sort transfers, that is, read the row pointer and the field to be sorted in memory to sort it, and then read the data row according to the sorting result, the new version uses a single transmission sort, that is, reading all data rows at a time, and then sorting based on the given column. For I/O-intensive applications, the efficiency will be much higher)

With the continuous development of MySQL, the optimization strategies used by the optimizer are constantly evolving. Here we only introduce several very common and easy-to-understand optimization strategies and other optimization strategies, check it by yourself. After the query execution engine completes the parsing and Optimization phase, MySQL generates the corresponding execution plan, and the query execution engine gradually runs the results according to the instructions given by the execution plan. Most of the operations in the entire execution process are completed by calling the interfaces implemented by the storage engine. These interfaces are called handlerAPI. Each table in the query process is represented by a handler instance. In fact, MySQL creates a handler instance for each table in the query optimization stage. The optimizer can obtain table information based on the interfaces of these instances, including all column names and index statistics of the table. The storage engine interfaces provide rich functions, but there are only dozens of interfaces at the underlying layer. These interfaces perform most of the operations of a query like building blocks. The last stage of the returned results to the client for query execution is to return the results to the client. Even if no data is found, MySQL returns information about the query, such as the number of rows affected by the query and the execution time. If the query cache is opened and the query can be cached, MySQL also stores the results in the cache. The client that returns the result set is an incremental and progressive process. It is possible that MySQL starts to gradually return the result set to the client when generating the first result. In this way, the server consumes too much memory without storing too many results, and the client can obtain the returned results immediately. It should be noted that each row in the result set will be sent as a packet meeting the communication protocol described in ①, and then transmitted through the TCP protocol. During the transmission process, mySQL data packets may be cached and sent in batches. Looking back, I will summarize the entire MySQL query execution process, which is divided into six steps:

The client sends a query request to the MySQL server.
The server first checks the query cache. If the cache is hit, the results stored in the cache are immediately returned. Otherwise, enter the next stage.
The server performs SQL parsing, preprocessing, and then the optimizer generates the corresponding execution plan
MySQL calls the storage engine API to execute the query according to the execution plan.
Return the results to the client and cache the query results.

After reading so many performance optimization suggestions, you may look forward to providing some optimization methods. Yes, we will provide some optimization suggestions from three different aspects. However, please wait. There is another piece of advice to give to you first: Do not listen to the "absolute truth" you see about optimization, including the content discussed in this article, in actual business scenarios, you should test your assumptions about the execution plan and response time. Scheme Design and data type optimization select data types as long as they follow the small but simple principle, the smaller the data type will usually be faster, occupying less disk and memory, the CPU cycle required for processing is also less. The simpler the data type requires less CPU cycles during computation. For example, an integer type is less expensive than a character operation. Therefore, an integer type is used to store IP addresses and DATETIME is used to store time, instead of using strings. Here are some tips that may be easy to understand:

Changing a column that can be NULL to not null does NOT greatly improve performance. However, if you plan to create an index on the column, you should set this column to not null.
Specify the width of the integer type, such as INT (11. INT uses 16 as the bucket, so its representation range has been determined, so INT (1) and INT (20) are the same for storage and calculation.
UNSIGNED indicates that negative values are not allowed, which roughly doubles the upper limit of positive numbers. For example, the storage range of TINYINT is generally speaking, there is not much need to use the DECIMAL data type. BIGINT can be used even when you need to store financial data. For example, to be accurate to one thousandth, You can multiply the data by 1 million and use TIMESTAMP to use 4 bytes of storage space. DATETIME uses 8 bytes of storage space. Therefore, TIMESTAMP can only represent 1970-2038, which is much smaller than the range indicated by DATETIME, and the value of TIMESTAMP varies with the time zone.
In most cases, it is not necessary to use the enumeration type. One drawback is that the enumerated string list is fixed. add and delete strings (enumeration options) you must use alter table (if you only append an element to the end of the list, you do not need to recreate the TABLE ).
Do not have too many schema columns. The reason is that when the storage engine API is working, data needs to be copied between the server layer and the storage engine layer through the row buffer format, and then the buffer content is decoded into columns at the server layer, the cost of this conversion process is very high. If there are too many columns and few actually used columns, the CPU usage may be too high.
The alter table of a large TABLE is very time-consuming. MySQL creates an empty TABLE using a new structure to perform most operations on modifying the TABLE results, and finds all the data inserted into the new TABLE from the old TABLE, then, delete the old table. Especially when the memory is insufficient and the table is large and there is a large index, it takes longer. Of course, there are some odd sex techniques to solve this problem. If you are interested, you can check them yourself.

Creating a high-performance index is an important way to improve MySQL query performance. However, excessive indexes may lead to high disk usage and high memory usage, thus affecting the overall performance of applications. You should avoid thinking about adding indexes as much as possible afterwards, because you may need to monitor a large number of SQL statements to locate the problem. In addition, the time required to add an index is much longer than the time required to add an index, it can be seen that the addition of indexes is also very technical. Next, we will show you a series of policies for creating high-performance indexes and the working principles behind each policy. But before that, you should first understand some indexing-related algorithms and data structures, which will help you better understand the content of the post. Index-related data structures and algorithms are generally referred to as B-Tree indexes, which are currently the most common and effective indexes for searching data in relational databases, most storage engines support this index. The term B-Tree is used because MySQL uses this keyword in CREATE TABLE or other statements. However, different storage engines may use different data structures, for example, InnoDB uses B + Tree. B In B + Tree refers to balance, which means balance. Note that the B + tree index cannot find a specific row with a given key value. It only finds the page on which the queried data row is located, then the database reads the page into the memory, searches in the memory, and finally obtains the data to be searched. Before introducing B + Tree, let's take a look at the binary search Tree. It is a classic data structure, and its left subtree value is always smaller than the root value, the value of the right subtree is always greater than the root value, for example, ①. If you want to find a record with a value of 5 in this course tree, the general process is: first find the root, its value is 6, greater than 5, so find the left subtree, find 3, if the number is greater than 3, the right subtree of 3 is found three times in total. In the same way, if you search for a record with a value of 8, you also need to search for it three times. Therefore, the average number of searches for the binary search tree is (3 + 3 + 3 + 2 + 2 + 1)/6 = 2.3, and for sequential searches, the query value is 2, only one query is required, but it takes six queries to find records with a value of 8. Therefore, the average number of sequential queries is: (1 + 2 + 3 + 4 + 5 + 6) /6 = 3.3 times, because in most cases, the average search speed of the Binary Search Tree is faster than that of the sequential search.

The binary search tree and the balanced binary tree can be constructed at any time. With the same value, the binary search tree of the binary tree can be constructed at the same time. Obviously, the query efficiency of the binary tree is similar to that of the sequential search tree. If you want to query the maximum number of binary search results, the binary search tree must be balanced, that is, the balanced binary tree (AVL Tree ). A balanced binary tree must first comply with the definition of a binary search tree. Secondly, the height difference between the two Subtrees of any node cannot be greater than 1. Obviously, figure ② does not meet the definition of a balanced binary tree, while Figure ① is a balanced binary tree. The query performance of the balanced binary tree is relatively high (the best performance is the optimal binary tree). The better the query performance, the larger the maintenance cost. When you need to insert a new node with a value of 9, you need to make the following changes.

It is easiest to rotate the inserted tree into a balanced binary tree by performing a left-hand operation. In actual application scenarios, it may need to be rotated multiple times. So far, we can consider the following question: balanced binary tree search efficiency is good, implementation is very simple, and the corresponding maintenance costs are acceptable. Why does MySQL index not directly use a balanced binary tree? As the data in the database increases, the size of the index itself increases, and it is impossible to store all the data in the memory. Therefore, the index is often stored on the disk as an index file. In this way, disk I/O consumption is required during index search. Compared with memory access, I/O access consumes several orders of magnitude. Can you imagine the depth of a binary tree with millions of nodes? If a binary tree with such a large depth is placed on a disk, each read node requires an I/O read from the disk, and the entire search time is obviously unacceptable. So how can I/O access times be reduced during the search process? An effective solution is to reduce the depth of the Tree and change the binary Tree to the m-Tree (multi-path Search Tree). B + Tree is a multi-path search Tree. To understand B + Tree, you only need to understand the two most important features. First, all the keywords (which can be understood as data) are stored on the Leaf node (Leaf Page ), non-leaf nodes (Index pages) do not store real data. All record nodes store key values on the same layer of leaf nodes in sequence. Second, all leaf nodes are connected by pointers. For example, a simplified B + Tree with a height of 2.

How does B + Tree simplify these two features? MySQL sets the size of each node to an integer multiple of a page (the reason is described below), that is, when the Node space is certain, each node can store more internal nodes, in this way, each node can index more accurately. The advantage of using pointer links for all leaf nodes is that they can be accessed in intervals. For example, if you are looking for records greater than 20 and less than 30, you only need to find node 20, you can traverse the pointer to find 25 and 30 in sequence. If no link pointer exists, the interval search cannot be performed. This is also an important reason why MySQL uses B + Tree as the index storage structure. Why does MySQL set the node size to an integer multiple of the page? This requires understanding of the disk storage principle. The disk itself is much slower to access than the primary storage. In addition to the mechanical motion loss (especially the normal mechanical hard disk), the disk access speed is usually one thousandth of the primary storage, in order to minimize disk I/O, the disk is not read strictly on demand, but preread every time. Even if only one byte is required, the disk starts from this location, read data of a certain length in sequence and put it into memory. The pre-read length is generally an integer multiple of the page. The reference page is the logical block for computer memory management. Hardware and OS usually divide the primary and disk storage areas into contiguous blocks of the same size. Each storage block is called one page (in many operating systems, the page size is usually 4 K ). The primary storage and disk exchange data in pages. When the data to be read by the program is not in the primary storage, a page missing exception is triggered, and the system sends a disk reading signal to the disk, the disk finds the starting position of the data and reads one or more pages consecutively into the memory. If an exception is returned, the program continues to run. MySQL cleverly utilizes the disk pre-read principle to set the size of a node to equal to one page, so that each node can be fully loaded only once I/O. To achieve this goal, each time you create a node, you can directly apply for a page space to ensure that a node is physically stored on a page. In addition, the computer storage allocation is page-aligned, it only requires one I/O to read a node. Assuming that the height of B + Tree is h, a retrieval requires a maximum of h-1I/O (root node resident memory), complexity $ O (h) = O (\ log _ {M} N) $. In actual application scenarios, M is usually large and often exceeds 100. Therefore, the height of the tree is generally relatively small, usually no more than 3. Finally, I will briefly understand the operations on B + Tree nodes and have a general understanding of index maintenance. Although indexes can greatly improve the query efficiency, however, maintaining indexes still costs a lot. Therefore, it is especially important to create an index reasonably. Taking the preceding tree as an example, we assume that each node can only store four inner nodes. Insert the first node 28, as shown in.

The leaf page and index page are not full, and then the next node 70 is inserted. After querying the Index Page, we know that the leaf node should be inserted between 50 and 70, but the leaf node is full, at this time, the split operation is also required. The current starting point of the leaf node is 50, so the leaf node is split Based on the intermediate value, as shown in.

At the end of the Leaf Page split, a node 95 is inserted. When the Index Page and Leaf Page are full, two splits are required, as shown in.

After Leaf Page and Index Page are split and split, such a tree is formed.

In order to maintain a balance, the B + Tree of the final Tree requires a large number of split-pages operations for the newly inserted values, while the split of pages requires I/O operations, to minimize page splitting, B + Tree also provides a rotation function similar to a balanced binary Tree. When LeafPage is full but its left and right sibling nodes are not full, B + Tree is not eager to split, but to move the records to the sibling node on the current page. In general, the left brotherhood is checked for rotation first. For example, in the second example above, when 70 is inserted, the left-hand operation is not performed for page splitting.

By rotating left-hand operations, you can minimize page split, reduce disk I/O operations during index maintenance, and improve index maintenance efficiency. Note that the operation of rotating and splitting is still required to delete and insert nodes. Through the above, I believe you have a general understanding of the data structure of B + Tree, but how does the index in MySQL organize data storage? In a simple example, if there is a data table: Mysql code for each row of data in the table, the index contains the value of last_name, first_name, and dob, shows how indexes organize data storage.

How to organize data storage for indexes: From: High-Performance MySQL, we can see that indexes are first sorted by the first field. When the names are the same, they are sorted by the third field, that is, the date of birth, it is for this reason that the "leftmost principle" of the index is available ". 1. MySQL does not use indexes. A non-independent column "independent column" means that an index Column cannot be part of an expression or a function parameter. For example, we can easily see that Mysql code is equivalent to id = 4, but MySQL cannot automatically parse this expression. The same principle applies to functions. 2. If a prefix index has a long column, you can usually start to index some characters, which can effectively save the index space and improve the index efficiency. 3. In most cases, creating an independent index on multiple columns does not improve query performance. The reason is very simple. MySQL does not know which index to choose for better query efficiency. Therefore, in the old version, for example, MySQL, the index of a column will be selected randomly, the new version uses the merge index policy. For example, in a movie player table, an independent index is created on the actor_id and film_id columns, and the following query is available: mysql of the old version of MySQL code selects an index randomly, but the new version is optimized as follows: Mysql code

When multiple indexes overlap (multiple AND conditions), an index containing all related columns is usually better than multiple independent indexes.
When multiple indexes are combined (multiple OR conditions), merging and sorting the result set requires a lot of CPU and memory resources, in particular, when some indexes have low selectivity and need to return and merge a large amount of data, the query cost is higher. In this case, it is better to scan the entire table.

Therefore, if you find that there is an index Merge (Using union occurs in the Extra field) in the explain statement, check whether the query and table structure are optimal. If there is no problem between the query and the table, it can only indicate that the index is very poor. You should carefully consider whether the index is suitable. It is possible that a multi-column Index containing all the relevant columns is more suitable. We have mentioned how indexes organize data storage. We can see that the index order is crucial for queries when multiple columns are indexed. It is obvious that more selective fields should be placed before the index, in this way, the first field can be used to filter out most non-conforming data. Reference Index selectivity refers to the ratio of non-repeated index values to the total number of records in the data table. The higher the selectivity, the higher the query efficiency, because the index with higher selectivity can cause MySQL to filter out more rows during query. The selectivity of the unique index is 1. At this time, the best index selectivity and performance are also the best. After understanding the index selectivity concept, it is not difficult to determine which field is highly selective. You can check it. For example, the Mysql code should be created (staff_id, customer_id) should the order be reversed? Execute the following query. If the selectivity of a field is closer to 1, it is better to index the field. Mysql Code uses this principle in most cases without any problems, but still pay attention to the existence of some special circumstances in your data. For example, to query user information that has been traded under a user group, Mysql code MySQL selects an index (user_group_id, trade_amount) for this query, this seems to have no problem, but the actual situation is that most of the data in this table is migrated from the old system, because the data in the new and old systems is not compatible, therefore, a default user group is assigned to the data migrated from the old system. In this case, the number of rows scanned by the index is basically no different from that scanned by the full table, and the index does not play any role. In general, empirical rules and inferences are useful in most cases and can guide us in development and design, but the actual situation is often more complex, in some special cases in actual business scenarios, your entire design may be destroyed. 4. Avoid multiple range conditions in actual development. We often use multiple range conditions. For example, if you want to query a user who has logged on within a certain period of time: Mys code, this query has a problem: it has two range conditions: The login_time column and the age column. MySQL can use the index of the login_time column or the index of the age column, but they cannot be used at the same time. 5. overwrite an index if an index contains or overwrites the values of all fields to be queried, there is no need to go back to the table for query. This is called overwrite an index. Covering indexes is a very useful tool that can greatly improve performance, because queries only need to scan indexes, which brings many benefits:

Index entries are much smaller than the size of data rows. If only the index is read, the data access volume is greatly reduced.
Indexes are stored in the order of column values. I/O-intensive range queries require less IO than random reads from each row of data from the disk.

6. Using index scanning to sort MySQL can produce ordered result sets in two ways. One is to sort the result sets, the second is that the results obtained by scanning the index order are naturally ordered. If the value of the type column in The explain result is index, index scanning is used for sorting. Scanning the index itself is fast, because only one index record needs to be moved to the next adjacent record. However, if the index itself cannot cover all the columns to be queried, you have to go back to the table to query the corresponding rows every time you scan an index record. This read operation is basically random I/O, so reading data in the index order is usually slower than scanning the whole table in the order. When designing an index, it is best if an index can satisfy both sorting and query requirements. The index can be used to sort the results only when the column ORDER of the index is exactly the same as that of the order by clause and the sorting direction of all columns is the same. If the Query Needs to associate multiple tables, the index can be used for sorting only when all the fields referenced BY the order by clause are in the first table. The limit of the order by clause and query is the same and must satisfy the requirements of the leftmost prefix (except that the leftmost column is specified as a constant, the following is a simple example). In other cases, you must perform sorting operations, but you cannot use index sorting. Mysql code 7. Redundancy and duplicate index redundancy indexes refer to indexes of the same type created in the same order on the same column. Such indexes should be avoided as much as possible and deleted immediately after discovery. For example, if an index (A, B) is created, then an index (A) is redundant. Redundant indexes often occur when A new index is added to A table. For example, A new index (A, B) is created, but this index is not an extended index (). In most cases, you should try to expand existing indexes instead of creating new indexes. However, there are very few performance considerations that require redundant indexes. For example, if an existing index is extended, it becomes too large, which affects other queries using this index. 8. It is a good habit to regularly delete unused indexes that have not been used for a long time. The topic about indexing is intended to stop. Finally, we should say that indexing is not always the best tool. Only when indexing helps increase the query speed and brings more benefits than the additional work, the index is valid. For very small tables, a simple full table scan is more efficient. Indexes are very effective for medium to large tables. For ultra-large tables, the cost of index creation and maintenance increases. Other technologies may be more effective at this time, such as partition tables. Finally, it is a virtue to explain and then test again. The optimized COUNT () Query COUNT () for a specific type of query may be the most misunderstood function. It has two different functions. One is to COUNT the number of values in a column, the second is the number of statistics rows. When calculating the column value, the column value is required to be non-NULL, and NULL is not counted. If you confirm that the expression in the brackets cannot be empty, it is actually counting the number of rows. The simplest thing is that when COUNT (*) is used, it is not extended to all columns as we imagined. In fact, it will ignore all columns and COUNT all rows directly. The most common misunderstanding is that we specify a column in brackets but want the statistical result to be the number of rows, and often mistakenly think that the performance of the former will be better. However, this is not the case. to COUNT the number of rows, use COUNT (*) directly, which is clear and has better performance. Sometimes, in some business scenarios, exact COUNT values are not required and can be replaced by approximate values. The number of lines displayed in the EXPLAIN statement is a good approximate value, and the EXPLAIN statement does not need to be executed, therefore, the cost is very low. Generally, executing COUNT () requires scanning a large number of rows to obtain accurate data. Therefore, it is difficult to optimize it. At the MySQL level, only indexes can be overwritten. If the problem persists, it can only be solved at the architectural level, such as adding a summary table or using an external cache system such as redis. Optimize JOIN query in Big Data scenarios, a redundant field is used between tables to associate with each other, which is better than directly using JOIN. If you do need to use join queries, note the following:

Make sure that the columns in the ON and USING statements have indexes. When creating an index, you must consider the order of association. When table A and table B are associated with column c, if the optimizer is associated with column A and column B, you do not need to create an index on the corresponding column of Table. Unused indexes impose additional burden. Generally, you only need to create an index on the corresponding column of the second table in the Association Sequence unless you have other reasons (for specific reasons, see the following analysis ).
Make sure that any expressions in group by and order by only involve columns in a table, so that MySQL can use indexes for optimization.

To understand the first technique of optimizing Association queries, you need to understand how MySQL executes Association queries. The current MySQL Association execution policy is very simple. It performs nested circular Association for any association, that is, one data entry is retrieved cyclically in a table, then, in the nested loop, search for matched rows in the next table until the matching behaviors in all the tables are found. Then, return the columns required for the query based on the matching rows of each table. Too abstract? The preceding example shows that Mysql performs Association operations based on the Association Sequence A and B in the query, then we can use the following pseudo code to show how MySQL can complete this query: Mysql code can see that the outermost layer query is based on. xx column to query,. if there is an index on c, the entire associated query will not be used. Looking at the query on the inner layer, it is obvious that if there is an index on B. c, the query can be accelerated. Therefore, you only need to create an index on the corresponding column of the second table in the association sequence. Optimize LIMIT pagination when paging operations are required, we usually use the LIMIT plus offset method, and add the appropriate order by statement. If there is a corresponding index, the efficiency is usually good. Otherwise, MySQL needs to perform a lot of File Sorting operations. A common problem is that when the offset is very large, such as LIMIT 10000 20, MySQL needs to query 10020 records and then only return 20 records, the first 10000 items will be discarded at a very high cost. The simplest way to optimize this query is to overwrite the index scan as much as possible, instead of querying all columns. Then, perform an association query as needed and then return all columns. If the offset is large, the efficiency will be greatly improved. Consider the following query: If the Mysql code table is very large, it is best to change this query to the following: the delay Association in the Mysql code will greatly improve the query efficiency, let MySQL scan as few pages as possible, obtain the records to be accessed, and then return to the original table to query the required columns based on the associated columns. Sometimes, if you can use bookmarks to record the location where the data was last retrieved, you can start scanning directly from the location of the bookmarked record next time to avoid using OFFSET, for example, the following query: other methods to optimize Mysql Code include using a pre-computed summary table or associating it with a redundant table. A redundant table only contains primary key columns and columns to be sorted. To optimize UNIONMySQL's UNION processing policy, create a temporary table first, insert the query results to the temporary table, and then perform the query. Therefore, many optimization policies cannot be used in UNION queries. It is often necessary to manually push the WHERE, LIMIT, order by and other words to each subquery, so that the optimizer can take full advantage of these conditions for optimization. Unless deduplication is required on the server, union all must be used. Without the ALL keyword, MySQL adds the DISTINCT option to the temporary table, which causes the uniqueness check of the data in the entire temporary table, the cost for doing so is very high. Of course, even if the ALL keyword is used, MySQL always puts the result into a temporary table, then reads the result and returns it to the client. Although this is not necessary in many cases, for example, you can directly return the results of each subquery to the client. The conclusion explains how the query is executed and where the time is spent, and some optimization process knowledge. This helps you better understand MySQL and understand the principles behind common optimization techniques. I hope that the Principles and Examples in this article will help you better link theory with practice and apply more theoretical knowledge to practice. Nothing else can be said. Let's leave two questions for everyone. You can think about the answer in your head. This is what people often talk about, but few people will think about why?

Many programmers will throw out the following idea during sharing: Do not use stored procedures as much as possible. stored procedures are not easy to maintain and will increase the cost of use. Therefore, you should put the business logic on the client. Since the client can do these things, why should we store the process?
The JOIN operation itself is quite convenient, and it is good to query directly. Why do we still need a view?

References

[1] Jiang chengyao; MySQL technology insider-InnoDB Storage engine; Mechanical Industry Press, 2013
[2] Baron Scbwartz; translated by Ninghai Yuan Zhou Zhenxing; high-performance MySQL (Third edition); Electronic Industry Press, 2013
[3] viewing MySQL index structure from B-/B + tree

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More