Essentials of MySQL optimization for large data volumes

Last Update:2015-04-30 Source: Internet

Author: User

Tags mysql query mysql index

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Now with the development of the Internet, the magnitude of the data is also exponential growth, from GB to TB to PB. The operation of the data is also more difficult, the traditional relational database can not meet the needs of fast query and insert data. At this time, the advent of NoSQL temporarily solved the crisis. It gains performance by reducing the security of the data, reducing support for transactions, and reducing support for complex queries. However, in some cases nosql compromises are not enough to satisfy the usage scenario, for example, some usage scenarios are absolutely business and security indicators. NoSQL is certainly not sufficient at this time, so it is still necessary to use a relational database.

While relational databases are inferior to NoSQL databases in massive amounts of data, their performance will meet your needs if you operate correctly. Different operations for the data, the direction of optimization is not the same. For data migration, query and insert operations, can be considered in different directions. When optimizing, you also need to consider whether other related operations will have an impact. For example, you can improve query performance by creating an index, but this can lead to the insertion of data because the performance of the insert degrades when you want to create an update index, and you can accept this reduction. Therefore, the optimization of the database is to consider a number of directions, looking for a compromise of the best solution.

One: Query optimization
1: Create an index.

The simplest and most commonly used optimization is the query. Because the read operation occupies most of the scale for CRUD operations, read performance essentially determines the performance of the application. The most common use for query performance is to create indexes. After testing, 20 million records, each record 200 bytes two columns varchar type. It takes a minute to query a record without an index, and the query time can be ignored when an index is created. However, when you add an index to an existing data, it takes a very large amount of time. After I insert 20 million records, I create an index about a few 10 minutes.

Disadvantages and occasions for creating indexes. Although creating an index can greatly optimize the speed of queries, the drawbacks are obvious. One is that when inserting data, creating an index also consumes part of the time, which reduces the performance of the insert to a certain extent, and the other is obviously that the data file becomes larger. When you create an index on a column, the length of each index is the same as the length you set when you created the column. For example, if you create a varchar (100), when you create an index on that column, the index length is 102 bytes, because a length of more than 64 bytes will add an additional 2 bytes to the length of the record index.

From the Ycsb_key column (length 100), you can see that I created an index with a name of Index_ycsb_key, each with a length of 102, and imagine that the size of the index cannot be underestimated when the data becomes extremely large. Also, it can be seen that the length of the index and the length of the column type are also different, such as varchar it is a variable length character type (see MySQL data type analysis), the actual storage length is the actual character size, but the index is the size of the length you declared. When you create a column, you declare 100 bytes, then the index length is the byte plus 2, and it doesn't matter how big your actual storage is.

In addition to the time it takes to create an index, the size of the index file becomes bigger and larger, and creating an index also requires looking at the characteristics of your stored data. When you store a large part of the data is a duplicate record, then this time to create an index is a harm without a profit. Please review the MySQL index introduction first. So, when a lot of data is repeated, the effect of the index's query promotion can be ignored directly, but at this time you also have to bear the performance cost of creating indexes when inserting data.

2: The cached configuration.

In MySQL there are a variety of caches, some cache is responsible for caching query statements, and some are responsible for caching query data. These cached content clients are not operational and are maintained by the server side. It will be updated with the corresponding actions such as your query and modification. With its configuration file we can see the cache in MySQL:

Here the main analysis of the query cache, which is mainly used to cache queries data. When you want to use the cache, you must set the Query_cache_size size to not 0. When the setting size is not 0, the server caches the results returned by each query, and the next time the same query server fetches the data directly from the cache, instead of executing the query. The amount of data that can be cached is related to your size setting, so when you set it large enough, the data can be fully cached to memory, and it will be very fast.

However, the query cache also has its drawbacks. When you do any update operation (Update/insert/delete) on the data table, the server will force the cache to flush the cached data to ensure that the cache is consistent with the database, causing the cached data to be invalidated. Therefore, when a table has a very good number of updates to the table, query cache will not be able to improve performance and affect the performance of other operations.

3:slow_query_log analysis.

In fact, for the query performance improvement, the most important is the most fundamental means is also slow_query settings.

When you set Slow_query_log to ON, the server will log each query, and log the query when it exceeds the slow query time (long_query_time) you set. While you optimize the performance, you can analyze the slow query log, the query of the slow query for the purpose of optimization. By creating various indexes, you can work with tables. Then why divide the table that, when not divided into the table when the place is the limit of performance. Here's a brief introduction.

4: Sub-database sub-table

The Sub-database table should be the killer of query optimization. The above measures in the amount of data reached a certain level, the role of optimization can not be obvious. At this time, the amount of data must be diverted. There are two kinds of measures, such as sub-database and sub-table. And there are two ways of dividing table and vertical slicing and horizontal slicing. Here's a brief introduction to each of these approaches.

For MySQL, the data files are stored as files on disk. When a data file is too large, the operation of the operating system on large files will be more cumbersome and time-consuming, and some operating systems do not support large files, so this time must be divided into tables. In addition, the common storage engine for MySQL is InnoDB, and its underlying data structure is B + tree. When the data file is too large, the B + tree will be more from the level and node, when querying a node may query a number of levels, and this will inevitably cause multiple IO operations to load into memory, it will certainly be time-consuming. In addition there are innodb for the B + tree lock mechanism. Lock each node, then when changing the table structure, this time the tree is locked, when the table file is large, this can be considered to be not possible.

So we have to do the operation of the sub-table and the library.

5: Sub-query optimization in the query often use subqueries, in sub-queries generally use in or exist keyword. For in and exist when the data volume is large to a certain extent, query execution time is very different. However, to avoid such situations, the best way to do this is to use a join query. Because in most cases, the server's query optimization for join is much higher than the subquery optimization. In the relatively high version of the 5.6,mysql query will automatically optimize in query into joint query, there will be no sub-query slow problem. Sometimes the DISTINCT keyword can be used to limit the number of subqueries, but it is important to note that distinct is often converted to group by, this time there will be a Temp Table, there will be a delay in copy data to the temporary table. For more sub-query optimizations, click.
Two: Data transfer

When the amount of data reaches a certain level, then moving the library will be a very prudent and dangerous job. It is a very difficult problem to ensure the consistency of the data, the processing of various emergencies and the change of data in the process of moving the library.

2.1: Insert data when the data migration, there will be a re-import of big data, you can choose the direct load file, sometimes you may need to insert code. At this point, you need to optimize the INSERT statement. This time you can use the Insert delayed statement, which is when you make an insert request, not immediately inserted into the database, but placed in the cache, waiting for the time to mature before inserting.

1, to optimize the query, should try to avoid full table scanning, should first consider the where and order by the columns involved in the index.

2, should try to avoid the null value of the field in the Where clause to judge, otherwise it will cause the engine to abandon the use of the index for a full table scan, such as:
Select ID from t where num is null; --You can set a default value of 0 on NUM, make sure that the NUM column in the table does not have a null value, and then query:
Select ID from t where num=0; 3. Try to avoid using the! = or <> operator in the WHERE clause, or discard the engine for full-table scanning using the index.
4, should try to avoid using or in the WHERE clause to join the condition, otherwise it will cause the engine to abandon the use of the index for full table scan, such as:
Select ID from t where num=10 or num=20--can be queried like this:
The select ID from the Where num=10 union ALL select ID from t where num=20; 5, in and not in also to be cautious, otherwise it will result in full table scan, such as:
Select ID from the Where num in (All-in); For a continuous number, you can use between to not use in:
Select ID from t where num between 1 and 3; 6, the following query will also cause a full table scan:
Select ID from the t where name like '%abc% '; -To improve efficiency, full-text search can be considered.
7, if the use of parameters in the WHERE clause, also causes a full table scan. Because SQL parses a local variable only at run time, the optimizer cannot defer the selection of the access plan to the runtime; it must be selected at compile time. However, if an access plan is established at compile time, the value of the variable is unknown, and therefore cannot be selected as an index entry. The following statement will perform a full table scan:
Select ID from t where num= @num; --You can force the query to use the index instead:
Select ID from the T with (index name) where num= @num; 8, should try to avoid in the WHERE clause of the field expression operation, which will cause the engine to abandon the use of the index for full table scan. Such as:
Select ID from t where num/2=100; -Should read:
Select ID from t where num=100*2; 9, should try to avoid in the WHERE clause function operations on the field, which will cause the engine to abandon the use of the index for a full table scan. Such as:
Select ID from t where substring (name,1,3) = ' abc '; --name ID that starts with ABC
Select ID from t where DATEDIFF (day,createdate, ' 2005-11-30 ') = 0; --' 2005-11-30 ' generated ID--should read:
Select ID from the t where name like ' abc% '; Select ID from t where createdate>= ' 2005-11-30 ' and createdate< ' 2005-12-1 '; 10. Do not perform functions, arithmetic operations, or other expression operations on the left side of the "=" in the WHERE clause, or the system may not use the index correctly.
11, if the index field is used as a condition, if the index is a composite index, then the first field in the index must be used as a condition to ensure that the system uses the index, otherwise the index will not be used, and should be as far as possible to match the order of the fields with the index.
12, do not write some meaningless queries, such as the need to generate an empty table structure:
Select Col1,col2 into #t from T where 1=0; --This type of code does not return any result set, but consumes system resources and should be changed to this:
CREATE TABLE #t (...); 13, a lot of times with exists instead of in is a good choice:
Select num from a where num in (select num from B); --Replace with the following statement:
Select num from a where exists (select 1 from b where num=a.num); 14, not all indexes are valid for the query, SQL is based on the data in the table to query optimization, when the index column has a large number of data duplication, SQL query may not take advantage of the index, such as the table has fields sex, male, female almost half, So even if you build an index on sex, it doesn't work for query efficiency.
15, the index is not the more the better, although the index can improve the efficiency of the corresponding select, but also reduce the efficiency of insert and UPDATE, because the INSERT or update when the index may be rebuilt, so how to build the index needs careful consideration, depending on the circumstances. The number of indexes on a table should not be more than 6, if too many should consider whether some of the indexes that are not used frequently are necessary.
16, as far as possible to avoid updating clustered index data columns, because the order of clustered index data columns is the physical storage order of table records, once the column value changes will result in the order of the entire table records adjustment, it will consume considerable resources. If your application needs to update clustered index data columns frequently, you need to consider whether the index should be built as a clustered index.
17, try to use numeric fields, if only the numeric information of the field as far as possible not to design as a character, which will reduce the performance of queries and connections, and increase storage overhead. This is because the engine compares each character in a string one at a time while processing queries and joins, and it is sufficient for a numeric type to be compared only once.
18, as far as possible to use Varchar/nvarchar instead of Char/nchar, because the first variable long field storage space is small, can save storage space, second for the query, in a relatively small field search efficiency is obviously higher.
19. Do not use SELECT * from t anywhere, use a specific field list instead of "*", and do not return any fields that are not available.
20. Try to use table variables instead of temporary tables. If the table variable contains a large amount of data, be aware that the index is very limited (only the primary key index).
21. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.
22. Temporary tables are not unusable, and they can be used appropriately to make certain routines more efficient, for example, when you need to repeatedly reference a dataset in a large table or a common table. However, for one-time events, it is best to use an export table.
23. When creating a temporary table, if you insert a large amount of data at one time, you can use SELECT INTO instead of CREATE table, avoid causing a lot of log, to increase the speed, if the amount of data is small, in order to mitigate the resources of the system table, create TABLE, Then insert.
24. If you use a temporary table, be sure to explicitly delete all temporary tables at the end of the stored procedure, TRUNCATE table first, and then drop table, which avoids longer locking of the system tables.
25, try to avoid using cursors, because cursors are inefficient, if the cursor operation of more than 10,000 rows of data, then you should consider rewriting.
26. Before using a cursor-based method or temporal table method, it is often more efficient to look for a set-based solution to solve a problem, a set-based approach.
27. Cursors are not unusable, as with temporary tables. Using Fast_forward cursors on small datasets is often preferable to other progressive processing methods, especially if you have to reference several tables to get the data you need. Routines that include "totals" in the result set are typically faster than using cursors. If development time allows, a cursor-based approach, and a set-based approach, you can try and see which method works better.
28. Set NOCOUNT on at the beginning of all stored procedures and triggers, set NOCOUNT OFF at the end. You do not need to send a DONE_IN_PROC message to the client after each statement that executes the stored procedure and trigger.
29, try to avoid large transaction operation, improve the system concurrency ability.
30, try to avoid the return of large data to the client, if the amount of data is too large, should consider the appropriate demand is reasonable.

Essentials of MySQL optimization for large data volumes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More