SQL performance optimization in the Data Warehouse (MySQL chapter)

Source: Internet
Author: User
Tags joins

For the first two years of the Data Warehouse, all calculations (including data ETL, and report calculation) were implemented using a high-profile standalone + MySQL approach. No OLAP). Used MySQL's own MyISAM and Columnstore engine infobright. This article summarizes some of the common performance issues and solutions that you and your team encountered during that time.

P.S. If not specifically stated, the following MySQL refers to using MyISAM as the storage engine.

1. avoid repetitive calculations with existing data

business requirements often have data such as a week/one-month calculation, such as the pv/uv of a particular page in the last week. The problem here is to take the entire week's log data and then calculate it when it is implemented. In fact, there is a repetition of the calculation, a day of data on different days were repeated 7 times.

The solution is very simple, is to divide the calculation, if it is the PV, the practice is to calculate the day of PV, then a week of PV will be counted 7 days of PV added. If it is a UV, then every day from the log data out of the corresponding visitor data, the last seven days of guest data stored in a single table, the calculation of weekly UV directly with this table to do the calculation, and do not need to grab a lot of data from the original log data to forget.

This is a very simple question and does not even require much knowledge of SQL, but is often overlooked in the development process. This is performance that only implements the business and ignores performance. Small-scale Data warehouse from the engineer, if the lack of awareness and work norms, it is prone to this problem, until the data warehouse the data volume becomes relatively large, will be found. Ability to determine requirements.

2 . Case How to use keywords

When this keyword is used in a case, it is convenient to classify a piece of data in a SQL statement when doing aggregation. For example, there is a score table below (the table name is scores):

 

now we need to count the average score of Xiao Zhang, Xiao Ming's average score and Xiaoming's language achievement. The final result should be:

The SQL implementation is as follows:

if now the score table has 12 million data, including 4 million of the name * 3 subjects, how long does the above calculation take? I did a simple test, the answer is 5.5 seconds.

and if we add an index to the Name column and change the SQL to the following syntax:

in this case, it will take only 0.05 seconds to complete.

So if there is an index, the previous method of implementation will not become faster? The answer is no, the time is the same as the original.

and if there is no index, how much time will be spent in the latter way? The test result is 3.3 seconds.

make a few things a second time:

The latter way of writing is always faster than the previous one, the difference is whether to first in where the data is filtered out. There are two advantages to using where: one is indexed, and the case is very likely to be used without an index (the specific use of the index is not explained in detail here, at least in this example, the previous method is not used in the index), the second is able to filter the data in advance, even if there is no index, The previous method scanned three times the full table of data (to do a case to sweep over), after the way to scan the whole table, the data filtered, case will not have so much data volume.

The reality is that development is often used only to implement functional logic, and is accustomed to restricting conditional fetching of data in case. Thus, in the case of a requirement in a similar example, the conditions that should be restricted are not written to where. This is the most common type of problem found in the actual code.

3 . How to take a number of pages

There is an important basic step in the Data Warehouse, which is to clean the data. For example, if the data source data is stored in JSON, the data warehouse in MySQL must extract the required fields from the JSON and make it into a separate table field. This step is cumbersome for direct SQL processing, so you can parse it in a JSON library of mainstream programming languages such as Java. Parsing needs to read the data, one-time read in is not possible, so to read in batches (equivalent to paging).

the initial implementation is to mark the offset of each fetch data, and then batch reads:

This kind of code, at the beginning of a few words of SQL execution speed is OK, but to the back will be more and more slow, because every time to read a large amount of data and discard, is actually a waste.

The efficient way to do this is to use the primary key in the table for paging. If the data is sorted according to the primary key, then it is possible (doing so requires that the primary key's sequence of values be contiguous.) Assuming that the primary key is a sequence of values we are relatively clear, is a continuous value from 10001-1000000):

even if the data is not sorted by primary key, you can also page by restricting the scope of the primary key. In this case, the primary key value sequence is not too big problem, that is, each time to get the data will be less than the ideal, anyway is used in data processing, does not affect the correctness:

In this case, because the primary key is indexed, the data speed will not be affected by the specific location of the data.

4 . Index Use

the use of indexes is a very important topic in SQL optimization of relational database, and it is also a common sense thing. However, engineers in the actual development is often the completion of the index to feel good, and do not check whether the index is used correctly, so it is simple to mention the case of the index.

or an example. If there is an e-commerce website that accumulates a day's access log table Item_visits, each record indicates that an item was visited once, including some information about the visitor, such as the user's ID, nickname, etc., with 1200多万条 data. Examples are as follows:

The product itself has a commodity table items, containing more than 800 kinds of goods, the table name of the product name and category:

the number of times each product category (Item_type) is accessed is now calculated. The implementation of SQL is not difficult:

and then since it's join, you need to index it on the join key. At this time, some engineers are conveniently indexed on the item_id of items. It takes 95 seconds to run. ( p.s. In my test scenario, this log table has 20 multiple fields, so although the number of records in this table is the same as the problem 2 2

The index, which is added in the front, is already suggestive of a problem. So how long do we need to run the index on Item_visit's item_id? 80 seconds.

check the execution plan with explain:

Notice here that the log table is used as the driver table (that is, the data is scanned from the log table, and the commodity table is nested within the nest Loop), so that the item_id of the two tables are used, the index of the commodity table is the join, The index of the log table can be used to overwrite the index (this overlay index is faster than the previous reason). It looks "cost-effective", in fact, because of giving up the item Small table drive, the speed is much slower.

Next, use the Straight_join connection method to change this SQL to a small table driver:

then look at the execution plan:

mysql cache this slides 35 pages are also described). And if the small table driver, there will be no problem.

next Update: Strictly speaking, this scenario has a restriction, that is, the commodity item_id in the big table is only part of all item_id. If the commodity item_id in the big table almost evenly covers all the item_id, then the run time is almost the same regardless of which table index is used for the join. This was overlooked when the experiment was done and the problem was discovered when we tried again. I hereby Add.

Summary: Here are two questions, one is to add the index when you need to think about how to increase, when not very sure can look at the execution plan, rather than dogmatic know "join to index." Learn how to optimize your SQL by just memorizing a few tips. In addition, MySQL in the choice of execution plan is not necessarily the best, if you find that the MySQL execution plan has a big problem, then need to engineer to adjust, MySQL similar to the oracle of hint help us to achieve the desired purpose, Just like the straight_join in the example.

5. too much of Join

in MySQL, if too many tables are required to join, a significant decrease in performance is caused. Again, let's take an example to illustrate.

first, generate a table (named Test) with only 60 records and 6 fields, where the first field is the primary key:

then make a query:

This means that the test table is associated with itself. The result of the calculation is obviously 60, and it is hardly time-consuming.

But how much time would it take if it was such a query (10 Test table associations)?

The answer is: definitely over 5 minutes. As a result of the actual test, 5 minutes has not been produced. The test here for convenience, with a table itself associated 10 times, in fact, if it is a different table, the effect is the same.

So what exactly is MySQL doing? Use show processlist to see the run-time situation:

it was in a state of statistics. This state, according to MySQL, is based on statistical information to generate the execution plan, certainly this explanation is not traced back. In fact, MySQL is in the process of generating the execution plan, one of the steps is to determine the join order of the table. By default, MySQL arranges all join sequences, counting the execution cost of each join sequence and taking the optimal one. In this way, the N table join will have a n! situation. The ten-table join IS 10! , about 3 million, so it's no wonder that MySQL is analyzing for half a day.

in the actual development process, there have been more than 30 tables associated with the case (there are 10^32 species join sequence). Once present, the time spent in the statistics state is often over 1 hours. This is only in the case where the amount of table data is very small and needs to be done in order to parse the point less. As for the reasons for this situation, no outside of the summary report we need to calculate too many fields, need to calculate data from a variety of places, and then splicing the data, the report in the maintenance process to add fields, and for various reasons did not remove the field has been discarded, so that the field must be more and more, The implementation of these field calculations requires more temporary results tables to be linked together, resulting in the need to correlate the table is more and more, the MySQL can not bear the weight.

There are two ways to solve this problem. From a development perspective, you can control the number of tables for joins. If you need to join the table too many, according to the classification of the business, the first round join, the number of tables to control within a certain range, and then get the first round of the join results, and then do a second round of global join, so there is no problem. From the operational perspective, you can set the Optimizer_search_depth parameter. It can control the depth of the join sequential traversal, and the greedy search gets the local optimal order. In general, there are a lot of table joins, all of which are said to be the same dimension of data needs to be stitched together into a large table, the join sequence is basically no requirement. So the appropriate to lower this value, for performance should say no impact.

6. Column Storage Engine Infobright

Infobright is a MySQL-based storage engine with features such as column storage/column compression and knowledge grid, which are suitable for data warehouse calculations. It is convenient to use it without having to take into account issues such as indexing. However, after a period of use, but also found the individual need to pay attention to the problem.

a problem is similar to MyISAM, do not take unwanted data. There is no need for data, including unnecessary columns (infobright use common sense. Of course the storage should also be noted, but the impact is relatively small, so there is no specific mention), and unnecessary rows (the number of rows can be extended, row storage line basically can exist in a storage unit, but the column storage one column is obviously impossible to exist in a storage unit).

The second problem is that Infobright does not give force when retrieving long characters. In general, the site's access log will have a URL field to identify the specific address of the access. This gives you the need to find a specific URL. For example, I want to find the number of visits to my blog in Cnblog's access log:

similar to the need to retrieve substrings within a long string, the Infobright execution time test is 1.5-3 times times MySQL.

As for the reason for the slow, here is a brief explanation: Infobright as a column database using the usual characteristics of Columnstore, compression (column database compression rate generally can be achieved within 10%, Infobright is no exception). In addition, in order to speed up the search speed, it also uses a kind of knowledge grid retrieval method, which can greatly reduce the amount of data to be read. The principle of knowledge grids is beyond the scope of this article and can be seen here . However, when querying the URL, the advantages of the knowledge grid can not be reflected, but the use of the knowledge grid itself brought about by the cost of the retrieval and compression of the long string is still there, even more than the general number of query fields to be large.

then according to its principle can give a solution that can explain the problem (although the practical degree is not high): if the entire table has a long string field query is more troublesome, you can sort the data according to this field and then import. In this way, when querying by this field, through the knowledge grid can be able to block out more "packets" (Infobright data compression unit), and unordered conditions of the data scattered in the various "packets", the decompression work is much larger. Using this method to query, test down its execution time is only about 0.5 times times the MySQL.

SQL performance optimization in the Data Warehouse (MySQL chapter)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.