Talking about the optimization technique of SQL statement

Source: Internet
Author: User
Tags joins sql server query

https://blogs.msdn.microsoft.com/apgcdsd/2011/01/10/sql-1/

First, Introduction

A problem that is highlighted in front of many developers or database administrators is the performance of the database system. The performance of the system is not ideal in part because the load of the application does exceed the actual processing power of the server, more because the system has a large number of SQL statements need to be optimized. This article is an example of how simplifying techniques can be used to optimize SQL statements, and also discusses some of the issues that should be considered in the design phase of the system and are closely related to statement performance.

If the reader does not know which SQL statements should be optimized for the database system, it is recommended that the reader refer to the author's other article, "Applying Profiler to optimize the SQL Server database system." This article does not discuss the index, it is recommended that readers refer to the author's article, "Applying indexing technology to optimize SQL statements," because indexing technology is an important part of optimizing SQL statements.

II. Simplifying SQL statements

1. Reasons for simplification

The simpler the SQL statement, the better. The more complex the statement, the more unpredictable its execution performance will be. Let's talk about SQL Server's query optimizer. The SQL Server query optimizer is a cost-based optimizer. The query optimizer analyzes the possible execution plans and selects a plan with the lowest estimated cost. For simple statements, the query optimizer quickly finds an efficient execution plan, such as trivial plan or quick plan, to execute the statement. This is ideal. Because SQL Server rarely consumes much resources on its build for a simple execution plan. For simplicity's sake, such plans are almost always the best way to execute.

For those complex statements, it may have thousands of different execution plans. In this case, the query optimizer does not analyze all possible combinations, but rather uses complex algorithms to find a plan of execution that is quite close to the theoretical minimum value. Full optimization of statements is very resource-intensive. The more complex the statement, the more likely SQL server is to stop in search of an optimal plan and execute the statement directly using a better plan that has already been compared. If the time to find the best execution plan is the same as the time the statement executes, it is better to execute the statement directly. Therefore, the execution plan generated by SQL Server is not necessarily the optimal execution plan. Based on this feature of the query optimizer, SQL statements are as simple as possible for stable execution performance. For complex SQL statements, try to simplify them.

2. Means of simplification

The means of simplification are varied. In the system planning phase, you must consider how to avoid complex queries. A bad design can cause you to have multiple cross-joins between countless tables to get data, which greatly reduces performance. The common simplification rules are as follows:

1) do not have more than 5 tables connected (join)

2) Consider using temporary tables or table variables to hold intermediate results.

3) Use fewer subqueries

4) View nesting not too deep

The more tables are connected, the more time and the overhead of the connection will be compiled, and the more poorly controlled the performance is. It is best to split the connection into smaller sections and execute them sequentially. Prioritize connections that can reduce results in large numbers. The benefit of splitting is not only to reduce the time of SQL Server optimizations, but also to enable SQL statements to be executed in the same way and in the order you can predict. The query optimizer for SQL Server is not perfect. Using such a simplification technique avoids the SQL Server optimizer producing an optimal execution plan. If you need to connect a lot of tables to get the data, it's likely to mean a design flaw.

Do not have a deep view nesting. I have seen a system, built a lot of views, the view is a complex statement, and then on the top of the view to establish a view, the system's view nested four layers of more. I'm not against using views, but the more views you have, the more complex the statement expands and the harder it is to optimize. General view nesting not more than 2 advisable.

The goal of storing intermediate results with temporary tables or table variables is also to simplify the SQL statement, which controls the order and execution of the execution. This technique can have dramatic effects at some point.

3. Examples of simplification

You can start by analyzing the execution plan of the SQL statement. Very valuable information can often be found in the execution plan. Take advantage of this information to do more with less when optimizing. Let's look at an example.

We have the following SQL statements:

Select T1. PID ' PR number ', RSDT ' Request date ', per.ename ' requestor ',

Ped.ename ' Deliver to ', dest ' Destination ', Pcat.cdesc ' Category ',

' claimable ' = case Claim if ' 1 ' then ' Yes ' Else ' No ' end,

' NRE ' = case NRE if ' 1 ' then ' Yes ' Else ' No ' End,

' Pmcal ' = case pmcal if ' 1 ' then ' Yes ' Else ' No ' End,

' Cap reld ' = case Caprel if ' 1 ' then ' Yes ' Else ' No ' End,

' Type ' = pt.tdesc, ' section ' = Psec.sectdesc,

STR (t1.prvalue,15,4) ' PR Value ', D.vndid ' Vendor Code ', t1.status, pes.ename ' PR is/with '

From Prmer T1

Left outer joins P_view per on per.ecode = T1.reqid and per. Cmpcode = T1.reqidcmpno

Left OUTER join P_view ped on ped.ecode = T1.dlyid and ped. Cmpcode = T1.dlyidcmpno

Left outer joins P_view pes on pes.ecode = T1.status and pes. Cmpcode = T1.statuscmpno

Left outer joins PRCG pcat on pcat.catid = T1.catid and pcat.catidcmpno = T1.catidcmpno

Left outer joins Prte Pt on pt.typeid = t1. Type

Left outer joins PRst psec on psec.sectprcode = t1. burelated

Left outer joins PRDTL D on T1. PID = D.pid and T1. Cmpno = d.cmpno and D.itmno = ' 1 '

where

(T1.type = ' 1 ')

and

(

t1.reqid = ' 22101 ' and t1.reqidcmpno = ' P '

or (

T1. PID in

(select DISTINCT (PID) from Pra1 where ApPID = ' 22101 ' and appidcmpno = ' P ')

and (t1. Cmpno in

(select Cmpno from Pra1 where ApPID = ' 22101 ' and appidcmpno = ' P ')

)

)

and

T1. PID like '%/0% '

or t1. PID like '%/1% '

or t1. PID like '%/2% '

or t1. PID like '%/3% '

or t1. PID like '%/4% '

or t1. PID like '%/5% '

or t1. PID like '%/6% '

or t1. PID like '%/7% '

or t1. PID like '%/8% '

or t1. PID like '%/9% '

ORDER by T1. Pid

Table ' Pra1 '. Scan count 2, logical reads 13522, physical reads 5, Read-ahead reads 13631.

Table ' worktable '. Scan count 178595, logical reads 1114272, physical reads 0, Read-ahead reads 0.

Table ' PRCM '. Scan count 1, logical reads 2986, physical reads 2, Read-ahead reads 2999.

Table ' Pre2 '. Scan count 3, logical reads 1659, physical reads, Read-ahead reads 369.

Table ' Gb_mp '. Scan count 3, logical reads 5496, physical reads 0, Read-ahead reads 1834.

Table ' gb_ml '. Scan count 3, logical reads Bayi, physical reads 0, Read-ahead reads 27.

Table ' PRCG '. Scan count 1, logical reads 4, physical reads 2, Read-ahead reads 2.

Table ' Prte '. Scan count 1, logical reads 2, physical reads 2, Read-ahead reads 0.

Table ' PRst '. Scan count 1, logical reads 1, physical reads 0, Read-ahead reads 1.

Table ' Prdtl '. Scan count 1, logical reads 9904, physical reads 3, Read-ahead reads 9947.

The corresponding execution plan (partial) is as follows:

Prid], '%/1% ', NULL) OR like ([t1].[ Prid], '%

Prid], '%/1% ',

89668 1 |–sort (ORDER by: ([t1].[ Compno] ASC))

89668 1 | |–merge join (left Semi join, Many-to-many Merge: ([t1].[ Prid]) = ([

89668 1 | Type]= ' 1′and like ([t1].[ Pri

121820 1 |    | |–clustered Index Scan (OBJECT: ([pr].[ DBO]. [PRCM]. [Pk_p

1131725 1 | |--clustered Index Scan (OBJECT: ([pr].[ DBO]. [Pra1]. [Pk_prapp

89591 89591 |–row Count Spool

1 1 |–filter (WHERE: ([pra1].[ appridcompno]= ' P '))

1 1 |–bookmark Lookup (Bookmark: ([Bmk1025]), OBJECT: ([PR]. [dbo].

26 1 |–index Seek (OBJECT: ([pr].[ DBO]. [Pra1]. [Idx_prapprova

Let's not talk about the execution plan. The light from the statement itself I found the following questions:

1) There are too many join joins, there are 7 of them, and they do not include connections that might be included in the view. Try to reduce the number of connections.

2) connection is outer join, very bad. Because outer join means that all rows must be queried on the left or right table. If the table is large and there is no corresponding where statement, then the outer join can easily lead to a table scan or index scan. Try to use inner join to avoid scan entire table.

3) The use of unrelated subqueries remains to be considered.

4) The LIKE statement uses a preceding percent semicolon, and the OR clause greatly increases the chance of using the scan table.

Then look at the output of the statistics IO, judging by logical reads those are the most expensive tables. One is pra1,logical reads 13522, which is larger than that. The other is worktable,logical reads is 1114272. Logical reads high table should be optimized first. Then look at the execution plan. I have identified the following issues from the plan:

1) The index of the PID field corresponding to the like statement is not used, but after all the data is queried to make the filter, so these like can not reduce IO at all.

2) using row Count Spool, which resulted in worktable a large number of logical reads. SQL Server uses row count spool to store intermediate results, which correspond to worktable in tempdb. If you're more careful, you'll find that this spool is caused by SQL Server's 89,591 nest Loop scan of a small table. The key to optimization is to avoid so many loops scan.

3) The clustered index scan is used in two places. The Clustered index scan is equivalent to a table scan of the full tables.

It is estimated that the statement's where statement condition is not strong enough or the index is not good enough.

4) One of the most critical points is that the number of rows returned is 77 rows, but logical reads is millions. As we said earlier, if the output of the result row and statistics IO is too large, it means that there is a lack of optimization somewhere. Where exactly is it? is a like statement. This is because only the filter of the like statement greatly reduces the number of rows returned, but the like statement is not able to use the index because of the preceding percent sign.

According to the above analysis, the following optimization suggestions can be obtained:

1) Use temporary table to hold the results of the T1 table, a total of 77 rows. Remember, the operation that can greatly reduce logical reads (or the number of rows returned) is performed first. So we'll first try to get the 77 rows of data. By carefully parsing the statements, you will find that the conditions in the Where are all for table T1, so use the WHERE clause above to query the table T1, and then store the results in the temporary table #t1:

Select t1 ..... into #tt1 from T1 where ... (as in the where above)

2) then connect the #tt1 with the other tables:

Select #t1 ...

Left OUTER join ...

Left OUTER join ...

Do you remember the benefits of splitting statements? Statements are now executed in the order and manner in which we can predict them.

3) Modify the program to remove the front percent semicolon.

4) Modify the statement from the point of view of the system design and remove the outer join.

5) Consider combining indexes or overriding indexes to eliminate clustered index scan.

The above 1 and 2 points are recommended to eliminate the worktable immediately, the performance is more than several times, the effect is very obvious.

Third, the system design principle of optimizing SQL statement

Simplification alone is not enough. The optimization of SQL statements should be considered in the design phase of the system. The more reasonable the system design, the fewer places to optimize or post-rework. If the logical flow of the system is unreasonable, then it often leads to a problem that needs an operation to be solved, but it can be implemented in several operations, which is reflected in the database is the issue of excessive or complex SQL statements. Therefore, database system design is the first key of high performance system.

Many system developers are more concerned with the implementation of the feature rather than the overall performance of the database system. Quite a few developers do not know much about database performance optimization techniques, and the SQL statements they write are often inefficient. Let's list some of the principles that should be considered in the design phase and closely related to performance.

1) Limit the result set

To minimize the result rows returned, include the number of rows and field columns. The larger the returned result, the greater the logical reads of the corresponding SQL statement, and the more the performance impact on the server. A very bad design is to return all the data from the table:

Select * FROM TableName

Even a small table can cause concurrency problems. Worse, the consequences would be disastrous if the table had millions of rows. It is not only possible to bring extremely heavy disk IO, but it is more likely to extrude other cached data in the database buffer, so that the data must be read again from disk the next time. No index can optimize statements without a WHERE clause. When such a statement runs, a large number of other small statements can appear to be timed out or slow. As long as a system has several such large statements running at irregular intervals, you will almost certainly notice the instability of the system performance. Therefore, a good SQL statement must be designed so that it has a where statement or top statement to limit the size of the result set. It should usually look like this:

Select col1,col2,...... from table1 where colx= ... and coly= ...

There is no where statement, or it is not possible to predict how many rows the where statement will return, a place that developers often ignore. There is no problem when the program is tested, because there is not enough data on the table at that time to expose performance issues. However, as the program is deployed into the actual environment, more and more table data, the problem will become more and more prominent. A stable and good system should be able to take into account the growth of the data and predict how much data the SQL statement will return and then handle accordingly. If you really don't know how much data the SQL statement will return, you can use top N to limit the result set, such as:

Select TOP col1,col2,...... from table1 where colx= ... and coly= ...

where n is not too big. I saw that there was a systematic adoption of n=20000, and it seemed that N was a bit bigger. You have to think, does my program really need to return so much data? Will the user of the program look at so much data?

If the statement results are indeed many, consider paging the result set. Paging is an effective means of limiting the result set. For example, first use the top N method to return the first 100 data. Only the user clicks on the next page to issue a query to get the next 100 rows.

2) Reasonable table design

In the design of the table, the key question is how to deal with the historical data of the table. The table data will grow larger. You must consider the data growth of the table. such as pre-consideration of the day, one weeks, or one months of data changes in the table. The common practice is to schedule the job to export the table data elsewhere, keeping the database to a certain size to achieve consistent performance. Some systems are designed according to the time table, for example, according to the month design table, such as 20,051 month table, 20,052 month table, 20,063 month table and so on. The advantage of this is that the size of each month's table is basically consistent and the performance can be guaranteed. The downside is that management is more complex and the program is designed to be able to access different month tables based on time.

A very exciting message is that SQL Server 2005 will support table partitioning techniques. Using the table partitioning technique, the flow window function of the data table can be realized. In the mobile window can easily move the historical data out, add new data, so that the size of the table is basically stable.

In addition, the design of the table does not need to be very normalized. A certain field redundancy can increase the efficiency of the SQL statement, reduce the number of joins, and increase the execution speed of the statement.

3) separate OLAP and OLTP modules

The statements for OLAP and OLTP types are very different. The former often needs to scan the entire table for statistical analysis, and the index is of little use to such statements. Indexes can only speed up aggregation operations such as Sum,group by. For this reason, it is almost impossible to optimize an OLAP-type SQL statement. OLTP statements, however, only require access to a small subset of the table's data, which can often be obtained from the memory cache. To avoid the interaction between OLAP and OLTP statements, these two types of modules need to be run separately on different servers. Because OLAP statements are almost always read data, there is no update and write operations, so a good experience is to configure a standby server, and then OLAP to access only the standby server.

There are often customer inquiries I say the database system becomes slow at the end of the month or at some point in the one-month period. What do you think it is? The end of the month is when the system generates OLAP reports. The report means that almost all of the table data scanned statistics, the server burden is naturally heavy, the system is of course slower than usual. I heard that some ERP systems generate a report for several hours.

4) Using Stored procedures

There are several benefits to consider when using stored procedures to encapsulate complex SQL statements or business logic. One is that the execution plan of the stored procedure can be cached in memory for a long time, reducing the time to recompile. The second is that the stored procedure reduces the complex interaction between the client and the server. Third, if you need to make some changes after the program is published, you can modify the stored procedure directly without modifying the program, avoiding the need to reinstall the deployment program.

Iv. concluding remarks

After reading this article, you should be aware of the techniques for simplifying SQL statements and the principles that should be taken into account in system design. Applying these techniques can improve the overall performance of the database system. Database system optimization is a big topic, this article just lists some useful experience, more need your practice.

Talking about the optimization technique of SQL statement

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.