How to make your SQL run faster

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

People tend to get into a misunderstanding when they use SQL, it is too focused on whether the resulting results are correct, ignoring the possible performance differences between different implementations, which are particularly evident in large or complex database environments, such as online transaction processing OLTP or DSS for decision support systems. In the work practice, the author finds that bad SQL often comes from improper index design, inadequate connection condition and the WHERE clause which is not optimized. After the proper optimization of them, the speed of their operation has been significantly improved! I will summarize these three aspects separately:

To give a more intuitive explanation of the problem, the SQL run time in all instances is tested, and no more than 1 seconds is represented (< 1 seconds).

Test environment--
Host: HP LH II
Frequency: 330MHZ
Memory: 128 MB
Operating system: Operserver5.0.4
Database: Sybase11.0.3

First, unreasonable index design
Example: The table record has 620000 rows, and the following SQL runs under different indexes:
1. A non-clustered index was built on date

Select COUNT (*) from the record where date >
' 19991201 ' and date < ' 19991214 ' and amount >
2000 (25 seconds)
Select Date,sum (amount) from record group by date
(55 seconds)
Select COUNT (*) from the record where date >
' 19990901 ' and place in (' BJ ', ' SH ') (27 seconds)

Analysis:
Date has a large number of duplicate values, under a non-clustered index, the data is physically stored on the data page, and a table scan must be performed to find all the rows in the range when the range is searched.

2. A clustered index on date

Select COUNT (*) from the record where date >
' 19991201 ' and date < ' 19991214 ' and amount >
2000 (14 seconds)
Select Date,sum (amount) from record group by date
(28 seconds)
Select COUNT (*) from the record where date >
' 19990901 ' and place in (' BJ ', ' SH ') (14 seconds)

Analysis:
Under the cluster index, the data is physically sequentially on the data page, and the duplicate values are also grouped together, so in the range lookup, you can find the starting point of the range, and only in this range scan the data page, avoid a wide range of scans, improve the query speed.

3. The combined index on the Place,date,amount

Select COUNT (*) from the record where date >
' 19991201 ' and date < ' 19991214 ' and amount >
2000 (26 seconds)
Select Date,sum (amount) from record group by date
(27 seconds)
Select COUNT (*) from the record where date >
' 19990901 ' and place in (' BJ, ' SH ') (< 1 seconds)

Analysis:
This is an irrational combination of indexes, because its leading column is place, the first and second SQL does not refer to place, and therefore does not use the index, and the third SQL is in the same table, and all of the referenced columns are included in the combined index, which makes the index overlay, so it is fast.

4. The combined index on the Date,place,amount

Select COUNT (*) from the record where date >
' 19991201 ' and date < ' 19991214 ' and amount >
Watts (< 1 seconds)
Select Date,sum (amount) from record group by date
(11 seconds)
Select COUNT (*) from the record where date >
' 19990901 ' and place in (' BJ ', ' SH ') (< 1 seconds)

Analysis:
This is a reasonable combination of indexes. It takes date as the leading column so that each SQL can take advantage of the index, and an index overlay is formed in the first and third SQL, thus achieving optimal performance.

5. Summary:

Indexes established by default are nonclustered, but sometimes they are not optimal; a reasonable index design is based on the analysis and prediction of various queries. Generally speaking:

①. Have a large number of duplicate values, and often have scope queries

(Between, >,<,>=,< =) and ORDER BY
, GROUP by-by-occurrence column, consider establishing a clustered index;

②. Multiple columns are frequently accessed at the same time, and each column contains duplicate values to consider establishing a composite index;

③. Combined index to maximize the indexing of critical queries, the leading columns must be the most frequently used columns.

Second, not sufficient conditions of the connection:
For example: The card has 7896 rows, there is a nonclustered index on the card_no, the table account has 191122 rows, there is a nonclustered index on the account_no, and a preview of the execution of two SQL under different table joins:

Select SUM (a.amount) from account A,
Card b where a.card_no = B.card_no (20 seconds)

Change SQL to:
Select SUM (a.amount) from account A,
Card b where a.card_no = B.card_no and A.
Account_no=b.account_no (< 1 seconds)

Analysis:
In the first connection, the best query is to make account as the outer table, card as the inner table, using the index card, the number of I/O can be estimated by the following formula:

The outer table account 22541 pages + (outer table account 191122 Line * Inner table card on card on the first row of the outer table to find 3 pages) = 595,907 times I/O

Under the second connection condition, the best query scheme is to make the card as the outer table, account as the inner table, using the index on account, the number of I/O can be estimated by the following formula:

1944 page + on the outer table card (7896 rows of the outer table card * Inner table = 4 pages to find for each row of the outer table on account) = 33,528 I/o

It can be seen that the real best solution will be implemented only if the connection condition is sufficient.

Summarize:

1. The query optimizer lists several sets of possible connectivity scenarios and finds the best solution for the least cost of the system, based on the connection conditions, before the multiple table operations are actually executed. The join condition takes into account the table with the index, the table with many rows, and the selection of the inner and outer tables can be determined by the formula: the number of matches in the outer table * The count of each lookup in the inner-layer table, and the product minimum is the best scheme.

2. View the method of executing the scheme--with set Showplanon, open the SHOWPLAN option, you can see the connection order, what index information to use, and to see more detailed information, you need the SA role to perform DBCC (3604,310,302).

Third, not-optimized where clause
1. Example: The columns in the following SQL conditional statement have an appropriate index, but the execution speed is very slow:

SELECT * FROM record where
SUBSTRING (card_no,1,4) = ' 5378 ' (13 seconds)
SELECT * FROM record where
amount/30< 1000 (11 seconds)
SELECT * FROM record where
Convert (char, date,112) = ' 19991201 ' (10 seconds)

Analysis:
Any result of the operation of the column in the WHERE clause is computed by column in the SQL runtime, so it has to do a table search instead of using the index above the column, and if the results are available at query compilation, you can optimize the SQL optimizer, use indexes, avoid table searches, So rewrite the SQL as follows:

SELECT * from record where card_no like
' 5378% ' (< 1 seconds)
SELECT * FROM record where amount
< 1000*30 (< 1 seconds)
SELECT * from record where date= ' 1999/12/01 '
(< 1 seconds)

You'll find that SQL is obviously fast up!

2. For example: The table stuff has 200000 rows, and the Id_no index is not clustered, see the following sql:

Select COUNT (*) from stuff where id_no in (' 0 ', ' 1 ')
(23 seconds)

Analysis:
The "in" in the Where condition is logically equivalent to ' or ', so the parser converts in (' 0 ', ' 1 ') to Id_no = ' 0 ' or id_no= ' 1 ' to execute. We expect it to look for each or clause separately, then add the result, so that the index on the id_no can be used, but in fact (according to Showplan) it takes an "or policy", that is, the row that satisfies each or clause is first fetched into the worksheet in the temp database, A unique index is created to remove the duplicate row, and the result is computed from this temporary table. As a result, the actual process does not take advantage of the Id_no index, and the completion time is also affected by the performance of the tempdb database.

Practice has proved that the more rows the table, the worse the performance of the worksheet, when the stuff has 620000 rows, execution time unexpectedly reached 220 seconds! You might as well separate an OR clause:

Select COUNT (*) from stuff where id_no= ' 0 '
Select COUNT (*) from stuff where id_no= ' 1 '

Get two results and make an addition. Because each sentence uses an index, the execution time is only 3 seconds, and the time is only 4 seconds under 620000 lines. Or, in a better way, write a simple stored procedure:
create proc Count_stuff as
DECLARE @a int
DECLARE @b int
DECLARE @c int
Declare @d char (10)
Begin
Select @a=count (*) from stuff where id_no= ' 0 '
Select @b=count (*) from stuff where id_no= ' 1 '
End
Select @c=@a+@b
Select @d=convert (char (), @c)
Print @d

Calculate the result directly, the execution time is as fast as the above!
Summarize:

Thus, the so-called optimization, where clause, takes advantage of the index, which occurs when a table scan or extra overhead is not optimized.

1. Any action on the column will result in a table scan, which includes database functions, calculation expressions, and so on, to move the action to the right of the equal sign whenever possible.

2.in, or clauses often use worksheets to invalidate indexes; If you do not produce a large number of duplicate values, you can consider the opening of the sentence; The open clause should contain an index.

3. To be adept at using stored procedures, it makes SQL more flexible and efficient.

From the above examples, we can see that the essence of SQL optimization is in the premise of correct results, with the optimizer can identify the statement, full use of the index, reduce the number of I/O table scan, as far as possible to avoid the occurrence of table search. In fact, SQL performance optimization is a complex process, these are only a manifestation of the application level, in-depth research will also involve the database layer of resource allocation, network layer flow control and the overall design of the operating system layer.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to make your SQL run faster

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to make your SQL run faster

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support