This article from: http://bbs3.chinaunix.net/viewthread.php? Tid = 83481 Author: Honorary moderator-genie enabling 1. Rational Use of Indexes
An index is an important data structure in a database. Its fundamental goal is to improve query efficiency. Currently, most database products adopt the isam index structure first proposed by IBM. The index should be used properly. The usage principles are as follows:
● The optimizer automatically generates an index for fields that are frequently connected but not specified as foreign keys.
● Index the columns that are frequently sorted or grouped (that is, group by or order by operations.
● Create a search for columns with different values that are frequently used in conditional expressions. Do not create an index for columns with fewer values. For example, in the "gender" column of the employee table, there are only two different values: "male" and "female", so there is no need to create an index. If an index is created, the query efficiency is not improved, but the update speed is greatly reduced.
● If there are multiple columns to be sorted, you can create a compound index on these columns ).
● Use system tools. For example, the Informix database has a tbcheck tool that can be checked on suspicious indexes. On some database servers, the index may be invalid or the reading efficiency may be reduced due to frequent operations. If an index-based Query slows down, you can use the tbcheck tool to check the index integrity, fix the issue if necessary. In addition, when a database table updates a large amount of data, deleting and re-indexing can increase the query speed.
2. Avoid or simplify sorting
Duplicate sorting of large tables should be simplified or avoided. When indexes can be used to automatically generate output in the appropriate order, the optimizer avoids the sorting step. The following are some influencing factors:
● The index does not contain one or more columns to be sorted;
● The Order of columns in the group by or order by clause is different from that of the index;
● Sort columns from different tables.
In order to avoid unnecessary sorting, We need to correctly add indexes and reasonably merge database tables (although it may affect table standardization sometimes, it is worthwhile to Improve the efficiency ). If sorting is unavoidable, you should try to simplify it, such as narrowing the column range of sorting.
3. Eliminates sequential access to data in large table rows
In nested queries, sequential access to a table may have a fatal impact on query efficiency. For example, the sequential access policy is used to create a nested layer-3 query. IF 1000 rows are queried at each layer, 1 billion rows of data are queried. The primary way to avoid this is to index the connected columns. For example, two tables: Student table (student ID, name, age ......) And Course Selection form (student ID, course number, score ). If you want to connect two tables, you need to create an index on the join field "student ID.
Union can also be used to avoid sequential access. Although all check columns are indexed, some forms of where clauses force the optimizer to use sequential access. The following query forces sequential operations on the orders table:
Select * from orders where (customer_num = 104 and order_num>; 1001) or order_num = 1008
Although indexes are created on customer_num and order_num, the optimizer still uses sequential access paths to scan the entire table in the preceding statement. Because this statement is used to retrieve the set of separated rows, it should be changed to the following statement:
Select * from orders where customer_num = 104 and order_num>; 1001
Union
Select * from orders where order_num = 1008
In this way, you can use the index path to process queries.
4. Avoid related subqueries
The label of a column appears in both the primary query and the where clause query. It is very likely that after the column value in the primary query changes, the subquery must perform a new query. The more nested query layers, the lower the efficiency. Therefore, avoid subqueries as much as possible. If the subquery is unavoidable, filter as many rows as possible in the subquery.
5. Avoid difficult Regular Expressions
Matches and like keywords support wildcard matching, technically called regular expressions. However, this matching is especially time-consuming. Example: Select * from customer where zipcode like "98 ___"
Even if an index is created on the zipcode field, sequential scanning is used in this case. If you change the statement to select * from customer where zipcode>; "98000", the query will be executed using the index, which will obviously increase the speed.
In addition, avoid non-starting substrings. For example, the statement SELECT * FROM customer WHERE zipcode [2, 3]>; "80" uses non-starting substrings in the where clause, so this statement does not use indexes.
6. Use temporary tables to accelerate queries
Sort a subset of a table and create a temporary table, which sometimes accelerates query. It helps avoid multiple sorting operations and simplifies the optimizer's work in other aspects. For example:
SELECT cust. name, rcvbles. balance ,...... Other columns
FROM cust, rcvbles
WHERE cust. customer_id = rcvlbes. customer_id
AND rcvblls. balance>; 0
AND cust. postcode>; "98000"
Order by cust. name
If this query is executed multiple times but more than once, you can find all the unpaid customers in a temporary file and sort them by customer name:
SELECT cust. name, rcvbles. balance ,...... Other columns
FROM cust, rcvbles
WHERE cust. customer_id = rcvlbes. customer_id
AND rcvblls. balance>; 0
Order by cust. name
Into temp cust_with_balance
Then, query the temporary table in the following way:
SELECT * FROM cust_with_balance
WHERE postcode>; "98000"
The temporary table has fewer rows than the primary table, and the physical order is the required order, which reduces disk I/O, so the query workload can be greatly reduced.
Note: after a temporary table is created, the modification to the primary table is not reflected. Do not lose data when the data in the master table is frequently modified.
7. Use sorting to replace non-sequential access
Non-sequential disk access is the slowest operation, as shown in the back-and-forth movement of the disk inventory arm. SQL statements hide this situation, making it easy for us to write a query that requires access to a large number of non-sequential pages when writing an application.
In some cases, the database sorting capability can be used to replace non-sequential access to improve queries.
The following is an example of a manufacturing company to illustrate how to optimize queries. The database of the manufacturing company contains three tables. The mode is as follows:
1. part table
Parts part description other columns
(Part_num) (part_desc) (other column)
102,032 Seageat 30G disk ......
500,049 Novel 10 M network card ......
......
2. vendor table
Vendor ID, vendor name, other columns
(Vendor _ num) (vendor_name) (Other column)
910,257 seageat corp ......
523,045 IBM corp ......
......
3. parven table
Part No. Manufacturer no. Part quantity
(Part_num) (vendor_num) (part_amount)
102,032 910,257 3,450,000
234,423 321,001 4,000,000
......
The following query runs regularly on these tables and generates a Report on the quantity of all parts:
Select part_desc, vendor_name, part_amount
From part, vendor, parven
Where part. part_num = parven. part_num
And parven. vendor_num = vendor. vendor_num
Order by part. part_num
If no index is created, the overhead of the preceding query code will be huge. Therefore, we create an index on the part number and the manufacturer number. Index creation avoids repeated scanning in nesting. Statistics on tables and indexes are as follows:
Table row size number of rows per page number of data pages
(Table) (row size) (Row count) (Rows/Pages) (Data Pages)
Part 150 10,000 25 400
Vendor 150 1,000 25 40
Parven 13 15,000 300 50
Index key size per page key quantity page quantity
(Indexes) (Key Size) (Keys/Page) (Leaf Pages)
Part 4, 500, 20
Vendor 4 500 2
Parven 8 250 60
It seems to be a relatively simple 3-table join, but its query overhead is very high. You can see from the system table that there is a cluster index on part_num and vendor_num, so the index is stored in the physical order. The parven table does not have a specific storage order. The size of these tables indicates that the success rate of non-sequential access from the buffer page is very small. The optimal query plan for this statement is: first read 400 pages from the part sequence, and then access the parven table unordered for 10 thousand times, 2 pages each time (one index page and one data page), a total of 20 thousand disk pages, and 15 thousand non-sequential access to the vendor table, combined with 30 thousand disk pages. It can be seen that the cost of disk access on this index is 50.4 thousand times.
In fact, we can improve the query efficiency by using a temporary table in three steps:
1. read data from the parven table in the order of vendor_num:
SELECT part_num, vendor_num, price
FROM parven
Order by vendor_num
INTO temp pv_by_vn
This statement reads parven (50 pages) sequentially, writes a temporary table (50 pages), and sorts it. Assume that the sorting overhead is 200 pages, which is 300 pages in total.
2. Connect the temporary table to the vendor table, output the result to a temporary table, and sort by part_num:
SELECT pv_by_vn, * vendor. vendor_num
FROM pv_by_vn, vendor
WHERE pv_by_vn.vendor_num = vendor. vendor_num
Order by pv_by_vn.part_num
Into tmp pvvn_by_pn
Drop table pv_by_vn
This query reads pv_by_vn (50 pages) and uses indexes to access the vendor table for 15 thousand times. However, due to the vendor_num order, in fact, the vendor table is read in an indexed ORDER (40 + 2 = 42 pages). The output table contains about 95 rows on each page, with a total of 160 pages. Writing and accessing these pages triggers 5*160 = 800 reads and writes, and the index reads and writes 892 pages.
3. Connect the output and the part to get the final result:
SELECT pvvn_by_pn. *, part. part_desc
FROM pvvn_by_pn, part
WHERE pvvn_by_pn.part_num = part. part_num
Drop table pvvn_by_pn
In this way, the query reads pvvn_by_pn sequentially (160 pages) and reads the part table 15 thousand times through the index. As the index is built, 1772 disk reads and writes are actually performed. The optimized ratio is. I did the same experiment on Informix Dynamic Sever and found that the time consumption optimization ratio is (if the data volume is increased, the proportion may be larger ).
Knot
20% of the Code took 80% of the time, a well-known law in programming, as well as in database applications. Our optimization focuses on SQL Execution efficiency for database applications. The key aspect of query optimization is to make the database server read less data from the disk and read pages in sequence rather than in an unordered manner.
Database Query Optimization Technology
People tend to fall into a misunderstanding when using SQL, that is, they are too concerned about whether the obtained results are correct, while ignoring the possibility of different implementation methods.
Performance differences, especially in large or complex database environments (such as online transaction processing (OLTP) or decision support system DSS)
. In my work practice, I found that poor SQL statements often come from inappropriate index design, unfilled connection conditions, and unoptimized where clauses. In
After they are properly optimized, their running speed is significantly improved! I will summarize the following three aspects:
---- For more intuitive explanation of the problem, the SQL running time of all instances has been tested, and the SQL running time of no more than 1 second is expressed as (<1 second ).
---- Test environment --
---- HOST: HP LH II
---- Clock speed: 330 MHZ
---- Memory: 128 MB
---- Operating System: Operserver5.0.4
---- Database: Sybase11.0.3
I. unreasonable index design
---- For example, a table with 620000 rows of record and rows with different indexes has the following SQL statements:
---- 1. A non-cluster index is created on date.
Select count (*) from record where date>;
'20140901' and date <'20140901' and amount>;
2000 (25 seconds)
Select date, sum (amount) from record group by date
(55 seconds)
Select count (*) from record where date>;
'123' and place in ('bj ', 'sh') (27 seconds)
---- Analysis:
---- There are a large number of duplicate values on date. In non-clustered indexes, data is physically stored on the data page at random. During range search, a table scan is required.
To find all rows in this range.
---- 2. A cluster index on date
Select count (*) from record where date>;
'20140901' and date <'20140901' and amount>;
2000 (14 seconds)
Select date, sum (amount) from record group by date
(28 seconds)
Select count (*) from record where date>;
'123' and place in ('bj ', 'sh') (14 seconds)
---- Analysis:
---- Under the cluster index, data is physically stored on the data page in order, and duplicate values are also arranged together. Therefore, you can find
At the beginning and end, and only scan data pages within this range, avoiding large-scale scanning and improving the query speed.
---- 3. composite indexes on place, date, and amount
Select count (*) from record where date>;
'20140901' and date <'20140901' and amount>;
2000 (26 seconds)
Select date, sum (amount) from record group by date
(27 seconds)
Select count (*) from record where date>;
'123' and place in ('bj, 'sh') (<1 second)
---- Analysis:
---- This is an unreasonable composite index, because its leading column is place, the first and second SQL statements do not reference place, so it is not used
The third SQL statement uses place, and all referenced columns are included in the composite index, which forms index coverage, so it is very fast.
---- 4. composite indexes on date, place, and amount
Select count (*) from record where date>;
'20140901' and date <'20140901' and amount>;
2000 (<1 second)
Select date, sum (amount) from record group by date
(11 seconds)
Select count (*) from record where date>;
'123' and place in ('bj ', 'sh') (<1 second)
---- Analysis:
---- This is a reasonable combination of indexes. It uses date as the leading column, so that each SQL can use the index, and the index is formed in the first and third SQL statements.
Therefore, the performance is optimized.
---- 5. Summary:
---- The index created by default is a non-clustered index, but sometimes it is not the best. A reasonable index design should be based on the analysis and prediction of various queries.
. Generally speaking:
---- ① There are a large number of repeated values and frequent range queries
(Between, >;,<,>;=, <=) and order
For columns generated by group by, you can consider creating a cluster index;
---- ②. Multiple columns are frequently accessed at the same time, and each column contains duplicate values. You can consider creating a composite index;
---- ③ The composite index should try to overwrite key queries, and its leading column must be the most frequently used column.
2. Incomplete connection conditions:
---- For example, the table card has 7896 rows, there is a non-clustered index on card_no, the table account has 191122 rows, and there is a non-clustered cable on account_no
Description: The execution of two SQL statements when the explain statement is connected to different tables:
Select sum (a. amount) from account,
Card B where a. card_no = B. card_no (20 seconds)
---- Change SQL:
Select sum (a. amount) from account,
Card B where a. card_no = B. card_no and.
Account_no = B. account_no (<1 second)
---- Analysis:
---- In the first join condition, the best query solution is to use the account as the outer table, and the card as the inner table. By using the index on the card, the I/O times can be as follows:
Formula estimation:
---- Page 22541 on the account of the outer table + (the first row of the outer table account * the third page corresponding to the first row of the outer table on the card of the inner table) = 191122
Times I/O
---- In the second join condition, the best query solution is to use card as the outer table, and account as the inner table. By using the account index, the I/O count can be
The following formula estimates:
---- 1944 page + on the card of the outer table (the fourth row of the card of the outer table * The fourth page corresponding to each row of the outer table on the account of the inner table) = 7896 times I/O
---- It can be seen that only a full set of connection conditions can be executed for the best solution.
---- Conclusion:
---- 1. Before a multi-table operation is executed, the query optimizer will list several possible connection schemes based on the connection conditions and find the optimal system overhead.
Solution. The join conditions should fully consider the tables with indexes and tables with multiple rows. The selection of the internal and external tables can be calculated by the formula: matching rows in the outer table * each query in the inner table
The number of queries is determined, and the minimum product is the best solution.
---- 2. view the method of execution solution-use set showplanon to open the showplan option to view the connection sequence and index information.
For more details, you must use the SA role to execute DBCC (3604,310,302 ).
3. Where clause that cannot be optimized
---- 1. For example, the columns in the following SQL condition statements have an appropriate index, but the execution speed is very slow:
Select * from record where
Substring (card_no, 5378) = '000000' (13 seconds)
Select * from record where
Amount/30 <1000 (11 seconds)
Select * from record where
Convert (char (10), date, 112) = '000000' (10 seconds)
---- Analysis:
---- Any operation results of the column in The WHERE clause are calculated one by one during SQL Execution. Therefore, it has to perform table search without using the column above
If these results can be obtained during query compilation, the SQL optimizer can optimize them and use indexes to avoid table search. Therefore, the SQL statement is rewritten
As follows:
Select * from record where card_no like
'123' (<1 second)
Select * from record where amount
<1000*30 (<1 second)
Select * from record where date = '2014/1/01'
(<1 second)
---- You will find that SQL is obviously getting faster!
---- 2. For example, the stuff table has 200000 rows and the id_no table has non-clustered indexes. Please refer to the following SQL statement:
Select count (*) from stuff where id_no in ('0', '1 ')
(23 seconds)
---- Analysis:
---- 'In' in THE where condition is logically equivalent to 'or', so the syntax analyzer will set in ('0', '1 ') convert to id_no = '0' or id_no = '1' to execute
Line. We expect it to search for each or clause separately, and then add the result, so that the index on id_no can be used; but in fact (according to showplan ),
However, it uses the "OR policy", that is, first extract the rows that satisfy each or clause, store them to the worksheet of the temporary database, and then create a unique index to remove duplicate rows. Finally
Calculate the result from this temporary table. Therefore, the index on id_no is not used in the actual process, and the completion time is also affected by the performance of the tempdb database.
---- Practice has proved that the more rows in a table, the worse the performance of the worksheet. When stuff has 620000 rows, the execution time reaches 220 seconds! It is better to divide the or clause
Open:
Select count (*) from stuff where id_no = '0'
Select count (*) from stuff where id_no = '1'
---- Two results are obtained, and the addition is worthwhile. Because each sentence uses an index, the execution time is only 3 seconds. In the case of 620000 rows, the execution time is only 4 seconds. Or
To write a simple stored procedure in a better way:
Create proc count_stuff
Declare @ a int
Declare @ B int
Declare @ c int
Declare @ d char (10)
Begin
Select @ a = count (*) from stuff where id_no = '0'
Select @ B = count (*) from stuff where id_no = '1'
End
Select @ c = @ a + @ B
Select @ d = convert (char (10), @ c)
Print @ d
---- Calculate the result directly, and the execution time is as fast as above!
---- Conclusion:
---- It can be seen that the where clause uses the index and cannot be optimized, that is, table scanning or additional overhead occurs.
---- 1. Any operation on the column will cause the table to scan, including database functions and calculation expressions. During query, try to move the operation to the right of the equal sign.
---- 2.in, or clauses usually use worksheets to invalidate the index. If there are no large number of duplicate values, consider splitting the clause. The split clause should contain
Index.
---- 3. Be good at using stored procedures to make SQL more flexible and efficient.
---- From the above examples, we can see that the essence of SQL optimization is to use the statements recognized by the optimizer to fully utilize indexes and reduce
Minimize the number of I/O scans on a table and avoid table searches. In fact, SQL Performance optimization is a complex process. The above is only a kind of entity at the application level.
Now, in-depth research will also involve resource configuration at the database layer, traffic control at the network layer, and the overall design of the operating system layer.