MySQL and OLAP: a best practice exploration of analytic SQL queries

Source: Internet
Author: User

A multi-dimensional analysis, rough fast solution is to use ROLAP (relational OLAP). Data is modeled after dimensions and stored in the Mysql,rolap engine (such as open source Mondrian) is responsible for translating OLAP requests into SQL statements to the database. The OLAP computation analysis function causes the MySQL to need to carry on the more complex SQL query, the performance tuning is essential, this article summarizes some practical principles.

OLAP Features

Typical applications for OLAP include complex dynamic reports that require support for drillthrough (roll-up and drill-down), slicing, dicing, and rotating operations. The following table summarizes the main differences between OLAP and OLTP systems. The characteristics of OLAP determine SQL query scenario and optimization scheme, the following will be from the index, aggregation, sub-query, table connection and pivoting, and other aspects are described separately.

Olap

Oltp

User Volume

Relatively small number of analyst users

High concurrency

Database design

Dimension model: Star, snowflake model

Standardization

Data volume

Big, easy to tens

Small, generally not more than millions

SQL Read-write scenario

Regular import, no updates, complex queries retrieve large amounts of data every time

Read and write small amounts of data in transactions per unit


An index of commonplace

After weighing the data fault-tolerant recovery and performance, the storage engine chooses InnoDB. The attributes of the InnoDB index are the primary key clustered index and the B+tree data structure. With these two features, you can improve the performance of data import and multi-dimensional combined slices.

1) Data import speed

For the InnoDB table primary key index, the clustered index causes all data in the table to be stored on the primary key index leaf node in the primary key order. If you do not import data in the primary key order, it can result in additional paging, data lookup, and mobile IO operations, so that the insertion speed of the InnoDB table is heavily dependent on the insertion order. The workaround is simple: The primary key uses the Auto_increment column.


2) Multi-dimension slicing

Multi-dimensional combination queries, grouping, and summarization operations are common, and it is essential to add composite indexes on multiple dimension fields, and the field selection and order of composite indexes is particularly important.

Who's in the line? The following principles are generally followed:

A) MySQL only index the leftmost prefix match, you can choose the most frequently queried field ranked first. Special case: How do I handle a small number of query scenarios if the field does not exist? Do I need to build another index? Suppose in Pangu system, operating units will generally appear in all queries, so will be established [operating units, industry, product lines ...] Composite index, but some high-level managers ' query statements do not include operating units, then they need to be re-established [industry, product line ...] Composite index? The answer is to look at the situation and provide a tip: Apply layer processing to add "operating unit in (all operating units)" condition to query SQL that does not include operating unit conditions

b) Best performance optimization principle determines the first ranking of the fields with the highest index sensitivity (count (distinct column)/count (*) can be calculated)

There is a problem that people tend to overlook, who is the last? The answer is: Geofang the words that might exist for the range criteria to be retrieved last. Here's a case.

...... WHERE  avg_csm_weekly >100  and  trade_id=  ORDER by balance
Assuming that the composite index is set to [avg_cms_weekly,trade_id, balance], MySQL will not use the remaining indexes because of the scope conditions on the avg_csm_weekly.

Polymerization

MySQL does not support hash aggregation, only stream aggregation is supported. Stream aggregation is sorted first based on the fields of group by, and then streams the sorted data for grouping aggregations. If you see the using temporary and using Filesort in the extra column of explain, the aggregation uses temporary table and file sort operations, which can lead to poor performance. The best optimization goal is to have the aggregation operation use covering index, which is to complete the aggregate query on the index without querying the table data at all.

The following query statement uses a composite index [trade_id,product_line_id]

Select Trade_id,product_line_id,count (*) from data_acct_info_weekly group bytrade_id,product_line_id

observe the query plan, Extra Column Display Using Index that indicates that the operation is covering Index query.

In OLAP analysis, aggregation operations on a time range are very common. The following is an example of the account daily consumption chart, summarizing several common time aggregation query templates

ACCOUNT_ID (account)

Stdate (data date)

Click_pay (click to spend)

1

2013-08-01

100

1

2013-08-02

150

2

2013-08-01

125

1) Cumulative Aggregation

Returns the cumulative consumption and average of an account since it was added.

SELECT a.account_id,a.stdate, SUM (Click_pay), AVG (Click_pay) from data_account_csm_daily a INNER JOIN data_account_csm_ Daily BON a.account_id=b.account_id  Andb.stdate<=a.stdategroup by A.account_id,a.stdateorder by a.account_id, A.stdate

2) Slide Accumulation

Returns the cumulative consumption and average of the account fixed window time

SELECT a.account_id,a.stdate, SUM (Click_pay), AVG (Click_pay) from data_account_csm_daily a INNER JOIN data_account_csm_ Daily BON a.account_id=b.account_id  andb.stdate<=a.stdate and    b.stdate>=date_add (A.stdate,interval- GROUP by A.account_id,a.stdateorder by A.account_id,a.stdate

3) MTD Cumulative

Returns the total consumption and average of the account since the beginning of the month

SELECT a.account_id,a.stdate,sum (Click_pay), AVG (Click_pay) from data_account_csm_daily a INNER JOIN data_account_csm_ Daily BON a.account_id=b.account_id  andb.stdate<=a.stdate    and B.stdate>=date_format (A.stdate, "%Y-% M-01 ") GROUP by A.account_id,a.stdateorder by A.account_id,a.stdate
Then the next rollup and cube are discussed. Assuming that the user needs to aggregate n dimensions, the n group by is then union the result, and using rollup can query the results of the N group by operation at one time. The following two statement query results are consistent, the execution plan is different, the former only need to scan once, the latter need to scan the table four times.

Statement 1:

SELECT Col1,col2,col3,sum (COL4) from TableGroup bycol1,col2,col3with ROLLUP
Statement 2:

SELECT Col1,col2,col3,sum (COL4) from TableGroup bycol1,col2,col3unionselect col1,col2,null,sum (COL4) from TableGroup Bycol1,col2unionselect Col1,null,null, SUM (COL4) from TableGroup to Col1unionselect null,null,null,sum (COL4) from table
Unlike rollup, which summarizes dimensions at the same level, cube summarizes all the dimensions, and the n-dimensional cube requires 2 N-sub-grouping operations. The current version of MySQL does not support cube operations, but with multiple group operations The Union simulates rollup the same, or you can use multiple rollup to operate the Union simulation cube.

Sub-query vs JOIN

Complex requirements scenarios cause some subquery scenarios to be unavoidable. There are many performance traps and misunderstandings in the sub-query, which deserves attention.

1) The main reason for poor MySQL subquery performance is that the subquery produces a temporary table? Not entirely correct, temporary tables are not scary, a complete SQL statement, From/join/group/where/order, and so on, without regard to the optimization of the index, it is possible to generate temporary tables. Therefore, a more rigorous formulation is that querying on a temporary table produced by a subquery can result in poor performance due to an inability to take advantage of indexes.

2) What are the real reasons why in subqueries tend to perform poorly? is the in query of the temporary table data too large, MySQL is too weak to support only a very small number of in subqueries? Not necessarily, the performance of the display list in (a,b,c) query is not bad, in subqueries the real performance trap is that the MySQL optimizer tends to optimize in-standalone subqueries into exists dependent subqueries! So when you look at the query plan for SELECT * FROM table1 WHERE table1.id in (select ID from table2), you will find that the Table2 query is depedentsubquery because of the MySQL optimization strategy + Historical reasons.

3) is the performance of subqueries necessarily weaker than join? Not necessarily, because MySQL does not support semi join (note), in some need scenarios, using subquery performance is better than join. For example, A and b a one-to-many relationship, if you just want to query the B table in the corresponding records of a table records, if you use join, you need to use distinct or group operations to redo. Use associated subqueries to avoid this overhead. Select ID from table1 WHERE EXISTS (select Table2.id from Table2where table2.id=table1.id)

About Join,mysql using the nested loop algorithm (note). In the typical star dimension model, the dimension table data volume is much smaller than the fact table, the join operation is often the size table connection, the performance problem is not big, this is not much to say. Combining the covering Indexmentioned earlier, this paper introduces a trick to improve paging efficiency by using JOIN:

paging often needs to be usedLIMIT OFFSET,when the offsets are large, such asLIMIT 100000,50,MySQLneed to retrieve100050data, the performance is severely degraded. The common way to handle this isa) To increase the sort secondary column,LIMITConvert to a range lookup operation on a secondary columnb) application layer caching mechanismCno one will turn to the100000page. When the above is not the spirit, you can choosecovering Index+join.

SELECT * FROM table1 INNER JOIN (select id from table1 ORDER by Indexed_col limit 100000,50) as a on table1.id = a.ID
This approach is more efficient because temporary tablesaoperate on the index only (InnodbThe primary key value is stored on the leaf node of the index) to obtain the desired rowIDafter that, and then the complete table isJoingets the additional required columns.

Note: MySQL the famous branch Mariodb Support Semi Join and the Hash Join

Other pivoting&unpivoting focus on the rotation of the row and column, and can be used to format the aggregated data for the presentation of the report, no longer repeat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.