MySQL and OLAP: analyticdb SQL query best practices for multi-dimensional analysis. the rough and fast solution is to use ROLAP (relational OLAP. Data is stored in MySQL after Dimensional Modeling. The ROLAP engine (such as the open-source Mondrian) is responsible for converting OLAP requests into SQL statements and submitting them to the database. The OLAP computing and analysis function makes MySQL need to perform more complex SQL queries, and performance optimization is essential. This article summarizes some practical principles.
OLAP features
Typical applications of OLAP include complex dynamic reports. Drilling (roll-up and drill-down), slicing, chunking, and rotating operations are required. The following table summarizes the main differences between OLAP and OLTP systems. The features of OLAP determine the SQL query scenario and optimization solution. The following describes the index, aggregation, subquery, table join, and sort Ting.
|
OLAP |
OLTP |
Number of Users |
Relatively small number of analysts |
High concurrency |
Database Design |
Dimension model: Star and snowflake model |
Normalization |
Data volume |
Large, tens of millions |
Small, generally no more than 1 million |
SQL read/write scenarios |
Regular import, generally no updates, complex query, a large amount of data retrieved each time |
Reads and writes a small amount of data per transaction. |
Indexes
After balancing data fault tolerance recovery and performance, the storage engine selects Innodb. Innodb indexes are characterized by primary key clustered indexes and B + Tree data structures. These two features can improve the performance of data import and multi-dimensional combination slicing.
1) Data import speed
It is the primary key index of the Innodb table. the clustered index stores all data in the primary key index leaf node in the primary key sequence. If the data is not imported in the primary key order, additional paging, data search, and I/O operations are performed. in this way, the Innodb table insertion speed depends heavily on the insertion order. The solution is relatively simple: the Auto_Increment column is used for the primary key.
2) multi-dimensional slicing
Multi-dimensional combined query, grouping, and summary operations are very common. Therefore, adding a composite index to multiple dimension fields is essential, and field selection and order of Composite Indexes are particularly important.
Who ranks No. 1? The following principles are generally followed:
A) Mysql only matches the leftmost prefix of the index. you can select the most frequently queried field as the first. Special case: if a few queries do not exist, what should I do? Do I need to create another index? Assume that in the Pangu system, the operating unit usually appears in all queries, so [operating unit, industry, product line...] will be established. But some high-level management personnel's query statements do not contain operation units, so we need to establish [industry, product line...] ? The answer is to look at the situation and provide tips: the application layer handles the situation and adds the "operating unit in (all Operating units)" condition to the SQL statement that does not include the operating unit conditions.
B) the best performance optimization principle determines the top priority of the field with the largest index discrimination (calculated using count (distinct column)/count)
There is also a problem that people tend to ignore. who is the last? The answer is: put the fields that may have a range condition to the end. Here is a case
……WHEREavg_csm_weekly >100ANDtrade_id= 19ORDER BY balance
If the composite index is [avg_cms_weekly, trade_id, balance], MySQL will not use the remaining index because of the range condition in avg_csm_weekly. Aggregation
MySQL does not support Hash aggregation and only supports stream aggregation. Streaming aggregation sorts data BY the group by field first, and then streams access the sorted data for grouping and aggregation. If you see Using temporary and Using filesort in the extra column of explain, it means that temporary tables and file sorting operations are used for aggregation, which may lead to poor performance. The optimal optimization goal is to enable the Covering Index for the aggregation operation, that is, you do not need to query table data at all, and only complete the aggregation query on the Index.
The following query statement uses the composite index [trade_id, product_line_id]
select trade_id,product_line_id,count(*) from data_acct_info_weekly group bytrade_id,product_line_id
Observe the query plan and display Using index in the extra column. this operation is a Covering Index query.
In OLAP analysis, aggregation operations on time ranges are very common. The following uses the daily consumption table of an account as an example to summarize several common time aggregation query templates.
Account_id (account) |
Stdate (data date) |
Click_pay (click to consume) |
1 |
2013-08-01 |
100 |
1 |
2013-08-02 |
150 |
2 |
2013-08-01 |
125 |
1) accumulative aggregation
Returns the cumulative consumption and average value of an account since it is added to a certain degree.
SELECT a.account_id,a.stdate ,SUM(click_pay),AVG(click_pay)FROM data_account_csm_daily a INNER JOIN data_account_csm_daily bON a.account_id=b.account_idANDb.stdate<=a.stdateGROUP BY a.account_id,a.stdateORDER BY a.account_id,a.stdate
2) slide accumulation
Return the cumulative consumption and average values in the fixed window of the account.
SELECT a.account_id,a.stdate ,SUM(click_pay),AVG(click_pay)FROM data_account_csm_daily a INNER JOIN data_account_csm_daily bON a.account_id=b.account_idANDb.stdate<=a.stdateAND b.stdate>=DATE_ADD(a.stdate,INTERVAL -30 DAY)GROUP BY a.account_id,a.stdateORDER BY a.account_id,a.stdate
3) MTD accumulation
Returns the cumulative consumption and average value of the account since the beginning of the month.
SELECT a.account_id,a.stdate,SUM(click_pay),AVG(click_pay)FROM data_account_csm_daily a INNER JOIN data_account_csm_daily bON a.account_id=b.account_idANDb.stdate<=a.stdateAND b.stdate>=DATE_FORMAT(a.stdate,”%Y-%M-01”)GROUP BY a.account_id,a.stdateORDER BY a.account_id,a.stdate
Then we will discuss ROLLUP and CUBE. Assume that you need to aggregate N dimensions and perform group by n times before UNION, and then use ROLLUP to query the results of n group by operations at a time. The query results of the following two statements are consistent, but the execution plan is different. the former only needs to scan once, and the latter needs to scan the table four times.
Statement 1:
SELECT col1,col2,col3,SUM(col4) FROM tableGROUP BYcol1,col2,col3WITH ROLLUP
Statement 2:
SELECT col1,col2,col3,SUM(col4) FROM tableGROUP BYcol1,col2,col3UNIONSELECT col1,col2,NULL,SUM(col4) FROM tableGROUP BYcol1,col2UNIONSELECT col1,NULL,NULL ,SUM(col4) FROM tableGROUP BY col1UNIONSELECT NULL,NULL,NULL,SUM(col4) FROM table
Unlike ROLLUP, dimensions are summarized at the same level. cubes summarize all dimensions. n cube dimensions require 2 n grouping operations. The current version of MySQL does not support CUBE operations, but it is the same as simulating ROLLUP using multiple groups to operate UNION. you can also use multiple ROLLUP to operate UNION to simulate CUBE operations. Subquery vs JOIN
Complex scenarios lead to inevitable subquery scenarios. There are many performance traps and misunderstandings about subqueries.
1) is the primary cause of poor performance of MySQL subqueries that generate temporary tables for subqueries? Incorrect. temporary tables are not terrible. a complete SQL statement, such as FROM/JOIN/GROUP/WHERE/ORDER, does not consider index optimization, temporary tables may be generated. Therefore, the stricter statement is that the index cannot be used for queries on temporary tables generated by subqueries, resulting in poor performance.
2) What are the real causes of poor performance of IN subqueries? Is it because the temporary table IN the IN query has a large amount of data and MySQL is too weak to support a very small number of IN subqueries? Not necessarily, the query performance of the list IN (a, B, c) is not poor, the real performance trap of IN subqueries is that Mysql Optimizer often optimizes IN independent subqueries into EXISTS subqueries! So when you observe the query plan for SELECT * FROM table1 WHERE table1.id IN (SELECT id FROM table2), you will find that the query for table2 is DEPEDENTSUBQUERY, which is actually the reason for MySQL optimization policy + history.
3) is the performance of subqueries weaker than that of JOIN queries? Not necessarily, because Mysql does not support Semi Join (note), subquery performance is superior to JOIN performance in some scenarios. For example, if you only want to query the records of Table A that have corresponding records in Table B in the one-to-multiple relationship between table A and Table B, if you want to use JOIN, you need to use the DISTINCT or GROUP operation to perform deduplication. This overhead can be avoided by using associated subqueries. SELECT id FROM table1 where exists (SELECT table2.id FROM table2WHERE table2.id = table1.id)
For Join, Mysql uses the Nested Loop algorithm (note ). In a typical star dimension model, the data volume of a dimension table is much smaller than that of a fact table. JOIN operations are usually performed on large and small tables, and performance issues are not significant. In combination with the Covering Index mentioned above, this article introduces a way to improve paging efficiency using JOIN:
Limit offset is often used for paging. when the OFFSET is large, for example, LIMIT 100000,50, MySQL needs to retrieve 100050 of the data, and the performance is severely degraded. The common processing method is a) add secondary columns for sorting, convert LIMIT to search for operations in the range of secondary columns B) application layer cache mechanism c) compromise of requirements, no one will go to 100000 pages. If none of the above works, you can select Covering Index + Join.
SELECT * FROM table1 INNER JOIN (SELECT id FROM table1 ORDER BY indexed_col limit 100000,50) AS a ON table1.id = a.id
This method is more efficient, because temporary table a only performs operations on indexes (the primary key value is stored on the Innodb index leaf node). after obtaining the required row id, join the table to obtain other required columns.
Note: The famous MySQL branch MarioDB supports Semi Join and Hash Join.
Other grouping Ting & ungrouping Ting focuses on row and column rotation changes. It can also be used to format aggregated data for report presentation and will not be repeated here.