Query optimization Technology of database
Query optimization Technology of database
Database system is the core of MIS, and online transaction processing (OLTP) and online analytical Processing (OLAP) based on database is one of the most important computer applications in banks, enterprises and government departments. From the application examples of most systems, query operation occupies the most proportion in various database operations, and the SELECT statement based on query operation is the most expensive statement in SQL statement. For example, if the amount of data accumulated to a certain extent, such as a bank's account database table information accumulated to millions or even thousands records, a full table scan often takes 10 minutes, or even hours. If you adopt a better query strategy than full table scan, you can often reduce the query time to a few minutes, thus the importance of query optimization technology.
The author found in the implementation of the application project, many programmers in the use of some front-end database development tools (such as PowerBuilder, Delphi, etc.) to develop database applications, only pay attention to the gorgeous user interface, do not pay attention to the efficiency of query statements, resulting in the development of the application system inefficient, The waste of resources is serious. Therefore, how to design efficient and reasonable query statement is very important. Based on the application example and database theory, this paper introduces the application of query optimization technology in the real system.
Analyze problems
Many programmers think that query optimization is the task of DBMS (database management System), which is not related to the SQL statements written by programmers, which is wrong. A good query plan can often improve the performance of the program by dozens of times times. A query plan is a collection of SQL statements that a user submits, and a query plan is a collection of statements that are produced after an optimized process. The process of the DBMS processing the query plan is as follows: After the lexical and grammatical checking of the query sentence, the statement is submitted to the query optimizer of the DBMS, after the optimizer finishes the optimization of algebraic optimization and access path, the statement is processed by the precompiled module and the query plan is generated. It is then submitted to the system for execution at the right time, and the execution results are returned to the user at the end. In the high version of the actual database products, such as Oracle, Sybase, and so on, the cost-based optimization method is used to estimate the cost of different query plans based on the information obtained from the System Dictionary table, and then choose a better plan. Although the current database products in query optimization has been doing better, but the user-submitted SQL statements is the basis for system optimization, it is difficult to imagine a bad query plan after the system optimization will become efficient, so users write the pros and cons of the sentence is critical. Query optimization for the system we do not discuss, the following highlights to improve the user query plan solution.
Solve the problem
The following is an example of a relational database system, Informix, that improves the user query plan.
1. Rational use of indexes
Index is an important data structure in database, and its basic aim is to improve the efficiency of query. Most of the database products now use IBM's first proposed ISAM index structure. The use of indexes is just right, with the following principles:
Indexes are established on columns that are frequently connected but not specified as foreign keys, while fields that are not frequently connected are automatically generated by the optimizer.
Index on a column that is frequently sorted or grouped (that is, a group by or order by operation).
A search is established on columns with more values that are often used in conditional expressions, and no index is established on columns with fewer values. For example, there are only two different values for "male" and "female" on the "Sex" column of the employee table, so there is no need to index. If indexing does not improve query efficiency, it can significantly reduce the speed of updates.
If there are multiple columns to be sorted, you can set up a composite index on those columns (compound index).
Use System Tools. If the Informix database has a Tbcheck tool, it can be checked on suspicious indexes. On some database servers, the index may fail or the read efficiency is reduced because of frequent manipulation, and if a query using an index slows down, try using the Tbcheck tool to check the integrity of the index and fix it if necessary. In addition, when a database table updates a large amount of data, deleting and rebuilding the index can increase the query speed.
2. Avoid or simplify sorting
You should simplify or avoid repeating sorting of large tables. The optimizer avoids sorting steps when it is possible to use indexes to automatically produce output in the appropriate order. Here are some of the factors that affect:
The index does not include one or several columns to be sorted;
The order of the columns in the group BY or ORDER BY clause is not the same as the index;
The sorted columns come from different tables.
In order to avoid unnecessary sorting, it is necessary to build the index correctly and consolidate the database table reasonably (although it may sometimes affect the normalization of the table, but it is worthwhile relative to the increase in efficiency). If sorting is unavoidable, try simplifying it, such as narrowing the range of sorted columns.
3. Eliminates sequential access to large table row data
In nested queries, sequential access to tables can have a fatal effect on query efficiency. For example, the use of sequential access strategy, a nested 3-level query, if each layer query 1000 rows, then the query will query 1 billion rows of data. The primary way to avoid this is to index the connected columns. For example, two tables: Student form (school number, name, age ...). ) and the selected timetable (school number, course number, grade). If two tables are to be connected, the index should be indexed on the connection field "School Number".
You can also use a set of collections to avoid sequential access. Although there are indexes on all of the check columns, some forms of the WHERE clause force the optimizer to use sequential access. The following query forces a sequential operation on the Orders table:
SELECT * FROM Orders WHERE (customer_num=104 and order_num>1001) OR order_num=1008
Although indexes are built on Customer_num and Order_num, the optimizer uses sequential access paths to scan the entire table in the above statement. Because this statement retrieves a collection of detached rows, it should be changed to the following statement:
SELECT * FROM Orders WHERE customer_num=104 and order_num>1001
UNION
SELECT * FROM Orders WHERE order_num=1008
This allows the query to be processed using the index path.
4. Avoid correlated subqueries
A column's label appears in both the main query and the query in the WHERE clause, it is likely that the subquery must requery once the column value in the main query changes. The more nested the query, the lower the efficiency, so the subquery should be avoided as much as possible. If the subquery is unavoidable, filter out as many rows as possible in the subquery.
5. Regular expressions to avoid difficulties
Matches and like keywords support wildcard matching, technically called regular expressions. But this kind of match is especially time-consuming. For example: SELECT * from the customer WHERE zipcode like "98_ _ _"
Even if an index is established on the ZipCode field, the sequential scan is used in this case. If you change the statement to select * from Customer WHERE zipcode > "98000", the index is used to query when executing the query, which obviously increases the speed significantly.
Also, avoid substrings that do not start. For example, a SELECT * from Customer WHERE zipcode[2,3] > "80" takes a non-start substring in the WHERE clause, so the statement does not use an index.
. Using temporary tables to speed up queries
Sorting a subset of a table and creating a temporary table can sometimes speed up queries. It helps to avoid multiple sorting operations and, in other ways, simplifies the work of the optimizer. For example:
SELECT cust.name,rcvbles.balance,......other Columns
From Cust,rcvbles
WHERE cust.customer_id = rcvlbes.customer_id
and rcvblls.balance>0
and cust.postcode> "98000"
ORDER BY Cust.name
If the query is to be executed multiple times and more than once, all unpaid customers can be found in a temporary file and sorted by the customer's name:
SELECT cust.name,rcvbles.balance,......other Columns
From Cust,rcvbles
WHERE cust.customer_id = rcvlbes.customer_id
and rcvblls.balance>0
ORDER BY Cust.name
Into TEMP cust_with_balance
Then query in the Temp table in the following way:
SELECT * from Cust_with_balance
WHERE postcode> "98000"
There are fewer rows in the temporary table than in the primary table, and the physical order is the required order, reducing disk I/O, so the query workload can be drastically reduced.
Note: Temporary table creation does not reflect changes to the primary table. When data is frequently modified in the primary table, be careful not to lose data.
7. Using sorting to replace non sequential access
Non-sequential disk access is the slowest operation, manifested in the movement of the disk access arm back and forth. The SQL statement hides this situation, making it easy for us to write queries that require access to a large number of non sequential pages when writing an application.
In some cases, using the ability of database sorting to replace non sequential access can improve the query.
Example analysis
Here we give an example of a manufacturing company to illustrate how query optimization can be done. The manufacturing company database includes 3 tables, as shown in the following pattern:
1. Part table
Part number part description other columns
(Part_num) (PART_DESC) (other column)
102,032 seageat 30G Disk ...
500,049 novel 10M network card ...
......
2. Vendor table
Manufacturer name of factory firm other columns
(Vendor _num) (Vendor_name) (other column)
910,257 seageat Corp ...
523,045 IBM Corp ...
......
3. Parven table
Part number factory name number of parts
(Part_num) (Vendor_num) (Part_amount)
102,032 910,257 3,450,000
234,423 321,001 4,000,000
......
The following query runs on these tables on a regular basis and produces a report on the number of all parts:
SELECT Part_desc,vendor_name,part_amount
From Part,vendor,parven
WHERE Part.part_num=parven.part_num
and Parven.vendor_num = Vendor.vendor_num
ORDER BY Part.part_num
If you do not establish an index, the overhead of the above query code will be enormous. To do this, we build the index on the part number and the manufacturer number. The establishment of the index avoids repeated scans in nesting. Statistical information about tables and indexes is as follows:
Table row Size number of rows per page number of data pages
(table) (Row size) (row count) (rows/pages) (Data Pages)
Part 150 10,000 25 400
Vendor 150 1,000 25 40
Parven 13 15,000 300 50
Index key dimensions per page key Quantity page number
(Indexes) (Key Size) (keys/page) (Leaf Pages)
Part 4 500 20
Vendor 4 500 2
Parven 8 250 60
It appears to be a relatively simple 3-table connection, but its query overhead is significant. By looking at the system tables, you can see that there are clustered indexes on Part_num and vendor_num, so the indexes are stored in a physical order. Parven tables have no specific order of storage. The size of these tables shows that the success rate of non sequential access from the buffered page is very small. The optimized query plan for this statement is: first read 400 pages sequentially from part, then 10,000 times for Parven tables, 2 pages per page (one index page, one data page), 20,000 disk pages in total, and 15,000 for the Vendor table in the end, with 30,000 disk pages. You can see that the disk access to this indexed good connection is 50,400 times.
In fact, we can improve query efficiency by using temporary tables in 3 steps:
1. Read the data in the order of Vendor_num from the Parven table:
SELECT Part_num,vendor_num,price
From Parven
ORDER BY Vendor_num
into temp PV_BY_VN
This statement sequentially reads Parven (50 pages), writes a temporary table (50 pages), and sorts. It is assumed that the cost of the sort is 200 pages, a total of 300 pages.
2. Connect the temporary table with the Vendor table, output the results to a temporary table, and sort by part_num:
SELECT pv_by_vn,* Vendor.vendor_num
From Pv_by_vn,vendor
WHERE Pv_by_vn.vendor_num=vendor.vendor_num
ORDER BY Pv_by_vn.part_num
Into TMP PVVN_BY_PN
DROP TABLE PV_BY_VN
This query reads PV_BY_VN (50 pages), and it accesses the Vendor table 15,000 times via index, but because in vendor_num order, it actually reads Vendor table (40+2=42 page) in an indexed order, and the output table is about 95 rows per page and 160 pages. Writing and accessing these pages causes 5*160=800 to read and write, and the index reads 892 pages.
3. The output and part connections get the final result:
SELECT Pvvn_by_pn.*,part.part_desc
From Pvvn_by_pn,part
WHERE Pvvn_by_pn.part_num=part.part_num
DROP TABLE PVVN_BY_PN
In this way, the query sequentially reads PVVN_BY_PN (160 pages), reads the part table 15,000 times through the index, and as a result of indexing, it actually makes 1772 disk reads and writes, with an optimized ratio of 30:1. The author does the same experiment on Informix Dynamic Sever, and finds that the optimization ratio of time consuming is 5:1 (the proportion may be larger if the amount of data is increased).
Summary
20% of the code used 80% of the time, this is a well-known law design, in the database application is the same. Our optimization captures key issues, and for database applications, the focus is on the efficiency of SQL execution. The key part of query optimization is to make the database server read data less from disk and read pages sequentially rather than sequentially.
How to make your SQL run faster
----people in the use of SQL often fall into a misunderstanding, that is too focused on the results are correct, and ignore
The possible performance differences between different implementations, which differ in large or complex databases
In the environment, such as online transaction processing OLTP or decision support system DSS, is particularly evident. The author in the work Practice
found that bad SQL often comes from improper indexing design, inadequate connectivity conditions, and an whe
The RE clause. After proper optimization of them, the speed of their operation has been significantly improved. Below I will be from these three
Aspects are summarized separately:
----in order to illustrate the problem more intuitively, the SQL runtime in all instances has been tested for no more than 1 seconds
expressed as (< 1 seconds).
----test environment--
----Host: HP LH II
----Frequency: 330MHZ
----Memory: 128 MB
----Operating system: Operserver5.0.4
----Database: Sybase11.0.3
First, unreasonable index design
----Example: The table record has 620000 rows, and the following SQL runs under different indexes:
----1. A non-clustered index was built on date
Select COUNT (*) from the record where date >
' 19991201 ' and date < ' 19991214 ' and amount >
2000 (25 seconds)
Select Date,sum (amount) from record group by date
(55 seconds)
Select COUNT (*) from the record where date >
' 19990901 ' and place in (' BJ ', ' SH ') (27 seconds)
----Analysis:
----date has a large number of duplicate values, which are physically randomly stored on a data page under a non-clustered index,
Range lookup, you must perform a table scan to find all the rows in the range.
----2. A clustered index on date
Select COUNT (*) from the record where date >
' 19991201 ' and date < ' 19991214 ' and amount >
2000 (14 seconds)
Select Date,sum (amount) from record group by date
(28 seconds)
Select COUNT (*) from the record where date >
' 19990901 ' and place in (' BJ ', ' SH ') (14 seconds)
----Analysis:
----under the cluster index, the data is physically sequentially on the data page, and the duplicate values are grouped together, so that in the van
When looking around, you can find the starting point of this range, and only scan the data page in this range, avoiding the big fan
The scanning speed is improved.
----3. Combined index on Place,date,amount
Select COUNT (*) from the record where date >
' 19991201 ' and date < ' 19991214 ' and amount >
2000 (26 seconds)
Select Date,sum (amount) from record group by date
(27 seconds)
Select COUNT (*) from the record where date >
' 19990901 ' and place in (' BJ, ' SH ') (< 1 seconds)
----Analysis:
----This is an unreasonable combination index because its leading column is place, and the first and second SQL are not cited
Use place, so there is no use of the index; the third SQL is in use, and all the columns referenced are included in the group
In the index, an index overlay is formed, so its speed is very fast.
----4. Combined index on Date,place,amount
Select COUNT (*) from the record where date >
' 19991201 ' and date < ' 19991214 ' and amount >
Watts (< 1 seconds)
Select Date,sum (amount) from record group by date
(11 seconds)
Select COUNT (*) from the record where date >
' 19990901 ' and place in (' BJ ', ' SH ') (< 1 seconds)
----Analysis:
----This is a reasonable combination of indexes. It takes date as the leading column so that each SQL can take advantage of the index and
And the index overlay is formed in the first and third SQL, so the performance is optimal.
----5. Summary:
----The index that is established by default is not a clustered index, but sometimes it is not optimal; a reasonable index design
Based on the analysis and prediction of various queries. Generally speaking:
----①. Have a large number of duplicate values, and often have scope queries
(Between, >,<,>=,< =) and ORDER BY
, GROUP by-by-occurrence column, consider establishing a clustered index;
----②. Multiple columns are frequently accessed at the same time, and each column contains duplicate values to consider establishing a composite index;
----③. Combined indexes as much as possible to make critical queries indexed, the leading columns must be the most frequently used columns.
Second, not sufficient conditions of the connection:
----Example: The card has 7896 rows, there is a nonclustered index on the card_no, and the table account has 191122 rows,
There is a nonclustered index on the account_no to take a picture of the execution of two SQL under different table join conditions:
Select SUM (a.amount) from account A,
Card b where a.card_no = B.card_no (20 seconds)
----Change SQL to:
Select SUM (a.amount) from account A,
Card b where a.card_no = B.card_no and A.
Account_no=b.account_no (< 1 seconds)
----Analysis:
----in the first connection, the best query scheme is to make account as the outer table, card as the inner table, using
The index on the card, whose I/O count can be estimated by the following formula:
----The outer table account 22541 page + (outer table account 191122 Line * Inner table card on the corresponding outer layer
3 pages to find in the first row of the table = 595,907 I/o
----in the second connection condition, the best query scheme is to make the card as the outer table, account as the inner table, using
The index on account, whose I/O count can be estimated by the following formula:
----on the Outer table card 1944 pages + (outer table card 7896 lines * Inner table account on account the corresponding outer table each
4 pages to find for rows = 33,528 I/o
----can be seen, the real best solution will be implemented only if there is a sufficient connection condition.
----Summary:
----1. The query optimizer lists several possible sets of connections, based on the join conditions, before it is actually executed
and find out the best solution for the least cost of the system. The connection condition takes into account the table with the index, the number of rows
table; The selection of the internal and external tables can be determined by the formula: the number of rows in the outer table that matches each lookup in the inner table, multiplied by
Product minimum is the best solution.
----2. View the method of executing a scenario--with Set Showplanon open the SHOWPLAN option, you can see the
The order in which the information is used and what index to use; To see more detailed information, the SA role is required to perform DBCC (3604,310,30
2).
Third, not-optimized where clause
----1. For example: the columns in the following SQL conditional statement have the appropriate indexes, but the execution speed is very slow:
SELECT * FROM record where
SUBSTRING (card_no,1,4) = ' 5378 ' (13 seconds)
SELECT * FROM record where
amount/30< 1000 (11 seconds)
SELECT * FROM record where
Convert (char, date,112) = ' 19991201 ' (10 seconds)
----Analysis:
The result of any operation of the column in the----WHERE clause is computed by column in SQL Runtime, so it has to
Do a table search without using the index above the column; If the results are available when the query is compiled,
Can be optimized by the SQL optimizer, using indexes to avoid table searches, so rewrite the SQL as follows:
SELECT * from record where card_no like
' 5378% ' (< 1 seconds)
SELECT * FROM record where amount
< 1000*30 (< 1 seconds)
SELECT * from record where date= ' 1999/12/01 '
(< 1 seconds)
----you'll find that SQL is noticeably faster.
----2. For example: The table stuff has 200000 rows and the Id_no index is not clustered, see the following sql:
Select COUNT (*) from stuff where id_no in (' 0 ', ' 1 ')
(23 seconds)
----Analysis:
The "in" in the----where condition is logically equivalent to ' or ', so the parser converts in (' 0 ', ' 1 ')
Executes for id_no = ' 0 ' or id_no= ' 1 '. We expect it to be looked up separately according to each or clause, and then the result
Added so that you can take advantage of the index on the Id_no, but actually (according to Showplan), it takes the "or policy"
, the row that satisfies each or clause is first fetched into the worksheet in the temporary database, and the unique index is used to remove
Repeat the row, and finally compute the result from this temporary table. As a result, the actual process does not use the Id_no index and ends
Time is also affected by the performance of the tempdb database.
----practice has shown that the more rows the table has, the worse the performance of the worksheet is, when the stuff has 620000 rows, the execution
Between 220 seconds. You might as well separate an OR clause:
Select COUNT (*) from stuff where id_no= ' 0 '
Select COUNT (*) from stuff where id_no= ' 1 '
----Get two results and add it again. Because each sentence uses an index, the execution time is only 3 seconds,
Under 620000 lines, the time is only 4 seconds. Or, in a better way, write a simple stored procedure:
create proc Count_stuff as
DECLARE @a int
DECLARE @b int
DECLARE @c int
Declare @d char (10)
Begin
Select @a=count (*) from stuff where id_no= ' 0 '
Select @b=count (*) from stuff where id_no= ' 1 '
End
Select @c=@a+@b
Select @d=convert (char (), @c)
Print @d
----The results are calculated directly, the execution time is as fast as the above.
----Summary:
----visible, the so-called optimization where clause takes advantage of the index, the table scan or extra overhead is not optimized.
----1. Any action on the column will result in a table scan, which includes database functions, calculation expressions, and so on, when the query
To move the action to the right of the equal sign as much as possible.
----2.in, or clauses often use worksheets to invalidate indexes, and if you don't produce a large number of duplicate values, consider
The clause should contain an index.
----3. To be adept at using stored procedures, it makes SQL more flexible and efficient.
----from the above examples, we can see that the essence of SQL optimization is in the premise of correct results, with the optimizer can
To identify the statement, the full use of the index, reduce the number of I/O table scan, as far as possible to avoid the occurrence of table search. actually s
QL performance optimization is a complex process, these are only in the application level of a embodiment, in-depth research will also
It involves the resource configuration of the database layer, the traffic control of the network layer and the overall design of the operating system layer.
1 logical database and table design
The logical design of database, including the relationship between table and tables, is the core of optimizing the performance of relational database. A good logical database design can lay a good foundation for optimizing databases and applications.
The standardized database logic design involves the use of multiple, interconnected narrow tables to replace long data tables for many columns. Here are some of the benefits of using standardized tables.
A: Because of the narrow table, you can make sorting and indexing more rapid
B: Multiple arrowheads indexes are possible due to multiple tables
C: Narrower and more compact index
D: You can have fewer indexes per table, so you can increase the speed of Insert update Delete, because these operations can have a significant impact on system performance in the case of multiple indexes
E: Fewer nulls and fewer redundant values increase the compactness of the database as a result of standardization, it increases the complexity of referencing the number of tables and the connections between them when fetching data. Too many tables and complex connections can degrade the performance of the server, so there is a need to take a holistic view between the two.
The main thing to be aware of when defining primary and foreign keys that have related relationships is that the key that is used to connect multiple tables and references must have the same data type.
2 Design of the index
A: Avoid table scans as much as possible
Check the WHERE clause of your query statement, as this is where the optimizer pays attention. Each column included in the Where (column) is a possible candidate index, and for optimal performance, consider the example given below: The column given column1 in the WHERE clause.
The following two conditions can improve query performance for indexes.
First: There is a single index on the Column1 column in the table
Second: There are multiple indexes in the table, but Column1 is the first indexed column
Avoid defining multiple indexes and Column1 is the second or subsequent index, such indexes do not optimize server performance
For example, the following example uses the pubs database.
SELECT au_id, au_lname, au_fname from authors
WHERE au_lname = ' White '
The indexes established on the following columns will be useful indexes for the optimizer
? au_lname
? au_lname, au_fname
The indexes that are built on the following columns will not work well for the optimizer
? au_address
? au_fname, au_lname
Consider using narrow indexes on one or two columns, narrow indexes are more efficient than multiple indexes and composite indexes. With a narrow index, on each page
There will be more rows and fewer index levels (relative to multiple indexes and composite indexes), which will drive system performance.
For a multiple-column index, SQL Server maintains a density statistic (for federation) on all the columns ' indexes, and on the first index
Histogram (columnar chart) statistics. According to statistical results, if the first index on a composite index is rarely selected, the optimizer will not use the index for many query requests.
Useful indexes improve the performance of the SELECT statement, including Insert,uodate,delete.
However, changing the contents of one table will affect the index. Each INSERT,UPDATE,DELETE statement will degrade performance. Experiments have shown that you do not use a large number of indexes on a single table, and do not share overlapping indexes on shared columns (referring to reference constraints in multiple tables).
Check the number of unique data on a column, comparing it with the number of rows in the table. This is the selectivity of the data, and the results will help you decide whether to use a column as a candidate for the indexed column, and if so, which index to build. You can use the following query statement to return the number of different values for a column.
Select COUNT (Distinct cloumn_name) from table_name
Assuming column_name is a 10000-row table, look at the column_name return value to determine whether it should be used and what indexes should be used.
Unique Values Index
5000 Nonclustered index
Clustered Index
3 No Index
Selection of arrowheads index and non-arrowheads index
< 1:> arrowheads Index is the physical order of the rows and the order of the indexes is consistent. Page-level, low-level, and so on, all levels of the index contain the actual data page. A table can have only one arrowheads index. Because the Update,delete statement requires a relatively few read operations, arrowheads indexing can often speed up such operations. In a table with at least one index, you should have a arrowheads index.
In the following few cases, you may consider using the arrowheads index:
For example, the number of different values included in a column is limited (but not very small)
The State of the Customer table has an abbreviated value of 50 or so different state names, and you can use the arrowheads index.
For example, you can use a arrowheads index for a column that returns a range of values, such as a column that operates on a column with between,>,>=,<,<= and so on.
SELECT * FROM sales where ord_date between ' 5/1/93 ' and ' 6/1/93 '
For example, a column that returns a large number of results on a query can use the arrowheads index.
SELECT * from phonebook WHERE last_name = ' Smith '
When a large number of rows are being inserted into a table, avoid establishing a arrowheads index on a column that is naturally growing (for example, the identity column) in this table. If you build a arrowheads index, the performance of the insert will be greatly reduced. Because each inserted row must go to the end of the table, the last?/ca> of the table