Poor SQL statements often come from inappropriate index design, incomplete join conditions, and unoptimizable where clauses. In
After they are properly optimized, their running speed is significantly improved! I will summarize the following three aspects:
For more intuitive explanation of the problem, the SQL running time of all instances has been tested, and the SQL running time cannot exceed 1 second is expressed as (<1 second ).
Test environment --
HOST: hp lh ii
Clock speed: 330 mhz
Memory: 128 MB
Operating System: operserver5.0.4
Database: sybase11.0.3
I. unreasonable index design
For example, a table with 620000 rows of record and rows with different indexes can run the following SQL statements:
1. Create a non-clustered index on date
Select count (*) from record where date> '20160301' and date <'20160301' and amount> 19991201 (25 seconds)
Select date, sum (amount) from record group by date (55 seconds)
Select count (*) from record where date> '123' and place in ('bj ', 'sh') (27 seconds)
Analysis:
There are a large number of duplicate values on date. In non-clustered indexes, data is physically stored on the data page at random. During range search, you must perform a table scan to find all rows in this range.
2. A cluster index on date
Select count (*) from record where date> '20160901' and date <'20160901' and amount> 19991201 (14 seconds)
Select date, sum (amount) from record group by date (28 seconds)
Select count (*) from record where date> '2013' and place in ('bj ', 'sh') (14 seconds)
Analysis:
Under the cluster index, data is physically stored on the data page in order, and duplicate values are arranged together. Therefore, you can first find the start and end points of this range during range search, in addition, only data pages are scanned within this range, which avoids large-scale scanning and improves the query speed.
3. composite indexes on place, date, and amount
Select count (*) from record where date> '20160301' and date <'20160301' and amount> 19991201 (26 seconds)
Select date, sum (amount) from record group by date (27 seconds)
Select count (*) from record where date> '2013' and place in ('bj, 'sh') (<1 second)
Analysis:
This is an unreasonable composite index, because its leading column is place, the first and second SQL statements do not reference place, so the upper index is not used; the third SQL uses place, all referenced columns are included in the composite index, which forms an index overwrite, so it is very fast.
4. Combined indexes on date, place, and amount
Select count (*) from record where date> '20160301' and date <'20160301' and amount> 19991201 (<1 second)
Select date, sum (amount) from record group by date (11 seconds)
Select count (*) from record where date> '2013' and place in ('bj ', 'sh') (<1 second)
Analysis:
This is a reasonable composite index. It uses date as the leading column, so that each SQL can use the index, and the index coverage is formed in the first and third SQL statements, so the performance is optimal.
5. Conclusion:
The index created by default is a non-clustered index, but sometimes it is not the best. A reasonable index design should be based on the analysis and prediction of various queries. Generally speaking:
① There are a large number of repeated values and frequent range queries
(Between, >,<>=, <=) and order by, group by columns, you can consider creating a cluster index;
②. Multiple columns are frequently accessed at the same time, and each column contains duplicate values. You can consider creating a composite index;
③ Composite indexes should try to overwrite key queries, and the leading column must be the most frequently used column.
2. Incomplete connection conditions:
For example, the table card has 7896 rows, there is a non-clustered index on card_no, the table account has 191122 rows, and there is a non-clustered index on account_no, explain execution of two SQL statements under different table connection conditions:
Select sum (A. Amount) from account A, card B where a. card_no = B. card_no (20 seconds)
Change SQL:
Select sum (a. amount) from account a, card B where a. card_no = B. card_no and a. account_no = B. account_no (<1 second)
Analysis:
Under the first join condition, the optimal query scheme is to use the account as the outer table, and the card as the inner table. The I/o times of the card can be estimated by the following formula:
Outer table account page 22541 + (the first row of the outer table account * the third page corresponding to the first row of the outer table on the card of the inner table) = 191122 times I/o
Under the second join condition, the best query scheme is to use card as the outer table and account as the inner table. The number of I/o times of the account can be estimated by the following formula:
1944 page + on the outer table card (the fourth row of the outer table card * The fourth page corresponding to each row of the outer table on the inner table account) = 7896 times I/o
It can be seen that only a full set of connection conditions can be executed for the best solution.
Summary:
1. Before a multi-table operation is executed, the query optimizer will list several possible connection solutions based on the connection conditions and find the best solution with the minimum system overhead. The join conditions must fully consider the tables with indexes and tables with multiple rows. The selection of the internal and external tables can be determined by the formula: Number of matched rows in the outer table * Number of times each query is performed in the inner table, the minimum product is the best solution.
2. view the method of execution solution-use set showplanon to open the showplan option, and you will be able to see the connection sequence and index information.
For more details, you must use the sa role to execute dbcc (3604,310,302 ).
3. where clause that cannot be optimized
1. For example, the columns in the following SQL condition statements have an appropriate index, but the execution speed is very slow:
Select * from record where substring (card_no, 5378) = '000000' (13 seconds)
Select * from record where amount/30 <1000 (11 seconds)
Select * from record where convert (char (10), date, 112) = '000000' (10 seconds)
Analysis:
Any operation results on the column in The where clause are calculated by column one by one during SQL Execution. Therefore, it has to perform table search without using the index on the column; if these results are obtained during query compilation, they can be optimized by the SQL optimizer and indexed to avoid table search. Therefore, the SQL statement is rewritten
As follows:
Select * from record where card_no like '000000' (<1 second)
Select * from record where amount <1000*30 (<1 second)
Select * from record where date = '2014/1/01' (<1 second)
You will find that SQL is getting faster!
2. For example, the stuff table has 200000 rows and the id_no table has non-clustered indexes. See the following SQL statement:
Select count (*) from stuff where id_no in ('0', '1') (23 seconds)
Analysis:
The 'in' in THE where condition is logically equivalent to 'or', so the syntax analyzer will set in ('0', '1 ') convert to id_no = '0' or id_no = '1' for execution. We expect that it will find the results separately based on each or clause, and then add the results, so that the index on id_no can be used; but in fact (according to showplan), it adopts the "or policy ", that is, first extract the rows that satisfy each or clause, store them to the worksheet of the temporary database, create a unique index to remove duplicate rows, and finally calculate the results from this temporary table. Therefore, the index on id_no is not used in the actual process, and the completion time is also affected by the performance of the tempdb database.
Practice has proved that the more rows in a table, the worse the performance of the worksheet. When stuff has 620000 rows, the execution time reaches 220 seconds! It is better to separate the or clause:
Select count (*) from stuff where id_no = '0'
Select count (*) from stuff where id_no = '1'
Two results are obtained, and the addition is more cost-effective. Because each sentence uses an index, the execution time is only 3 seconds. In the case of 620000 rows, the execution time is only 4 seconds. Or, write a simple stored procedure in a better way:
Create proc count_stuff
Declare @ a int
Declare @ B int
Declare @ c int
Declare @ d char (10)
Begin
Select @ a = count (*) from stuff where id_no = '0'
Select @ B = count (*) from stuff where id_no = '1'
End
Select @ c = @ a + @ B
Select @ d = convert (char (10), @ c)
Print @ d
Calculate the result directly, and the execution time is as fast as above!
Summary:
It can be seen that the where clause uses the index and cannot be optimized, that is, table scanning or additional overhead occurs.
1. Any operation on the column will cause the table to scan, including database functions and calculation expressions. During the query, try to move the operation to the right of the equal sign.
2. the in and or clauses usually use worksheets to invalidate the index. If there are no large number of duplicate values, consider splitting the clause. The split clause should contain the index.
3. Be good at using stored procedures to make SQL more flexible and efficient.
- As few data as possible in the index, that is, narrow indexes are more easily selected;
- The clustered index code should be included in all non-clustered indexes of the table, so the clustered index code should be as short as possible;
- Establish highly selective non-clustered indexes;
- Clustering indexes cannot be created for columns with frequent requests. Non-clustered indexes should be created and the values should be unique as much as possible;
- Minimize hotspot data. If some data in the table is frequently read and written, the data is the hotspot data, and you need to find a way to distribute the hotspot data;
- Monitors disk data traffic. If the utilization rate is too high, you need to consider index columns and distribute data on multiple disks to reduce I/O;
- In a table with at least one index, there should be a clustered index. The number of different values is limited. columns that return values within a certain range are returned. When querying, columns that return a large number of results are considered to create a clustered index;
- Analyze the Where clause of frequently-used SQL statements to obtain frequently-Valued Data. Consider creating indexes for these data columns based on common query types;
- If the primary Code involves multiple data columns, you must first place the data columns with significant changes. If the data column changes to a certain degree, place frequently accessed Data columns first;
- There are a large number of repeated values and frequent range queries. For example (between, >,<>=, <=), order by, and group by columns, you can consider creating a clustered index;
- When an SQL query statement simultaneously accesses data in multiple columns and each column contains duplicate values, you can consider creating a overwriting index to overwrite key queries, its leading column must be the most frequently used column;
- An index with a short index value has a high efficiency, because each index page can store a large number of index rows, and the index level is also relatively small. Therefore, the cache can prevent more index columns, which also reduces I/O operations;
- Too many indexes on the table will affect the performance of UPDATE, INSERT, and DELETE, because all indexes must be adjusted in response. In addition, all paging operations are recorded in the log, which also increases the I/O operation;
- Generally, clustered indexes are not set up for columns that are frequently updated. This will cause moving of the entire row and seriously affect the performance;
- Data Tables with few or few data queries generally do not need to be indexed;
- Columns used with order by or group by generally use clustered indexes. If the Column Used in the order by command has a clustered index, a temporary table is not generated because the rows have been sorted. The group by command generates a temporary table;
- When a large number of rows are being inserted into the table, avoid creating clustered indexes on a column that naturally grows (such as the Identity column) in the table. If a clustered index is created, the INSERT performance will be greatly reduced, because each inserted row must go to the last data page of the table.
From the above examples, we can see that the essence of SQL optimization is to use the statements that can be recognized by the optimizer and use indexes in full to reduce the number of I/o scans on the table, avoid table search as much as possible. In fact, SQL Performance optimization is a complex process. The above is only a manifestation of the application layer, in-depth research will also involve resource configuration at the database layer, traffic control at the network layer, and the overall design of the operating system layer.