MySQL DataBase Query Optimization Technology

Source: Internet
Author: User
Tags informix

ManyProgramWhen developers use some front-end database development tools (such as PowerBuilder and Delphi) to develop database applications, they only pay attention to the gorgeous user interface and do not pay attention to the efficiency of query statements, as a result, the developed application system is inefficient and resources are wasted heavily. Therefore, it is very important to design efficient and reasonable query statements. Based on the application example and the database theory, this article introduces the application of the query optimization technology in the real system.

Analyze problems

Many Programmers think that query optimization is a task of DBMS (Database Management System). It has little to do with the SQL statements compiled by the programmers. This is wrong. A good query plan can often increase the program performance by dozens of times. A query plan is a set of SQL statements submitted by the user. A query plan is a set of statements generated after optimization. The process of the DBMS processing the query plan is as follows: after the lexical and syntax check of the query statement is completed, the statement is submitted to the query optimizer of the DBMS, after the optimizer completes algebra optimization and access path optimization, the pre-compilation module processes the statements and generates query plans, and then submits the statements to the system for processing and execution at the appropriate time, finally, return the execution result to the user. The actual database products (such as Oracle and Sybase) Use a cost-based Optimization Method in later versions, this optimization can estimate the cost of different query plans based on the information obtained from the system dictionary table, and then select a better plan. Although the database products have been doing better and better in query optimization, the SQL statements submitted by users are the basis for system optimization, it is hard to imagine that a bad query plan will become efficient after the system is optimized. Therefore, the merits and demerits of the statements written by users are crucial. We will not discuss the system's query optimization. The following describes the solutions to improve the user's query plan.
Solve the problem

The following uses Informix as an example to describe how to improve the user query plan.

1. Use indexes reasonably

An index is an important data structure in a database. Its fundamental goal is to improve query efficiency. Currently, most database products adopt the isam index structure first proposed by IBM. The index should be used properly. The usage principles are as follows:
● The optimizer automatically generates an index for fields that are frequently connected but not specified as foreign keys.
● Index the columns that are frequently sorted or grouped (that is, group by or order by operations.
● Create a search for columns with different values that are frequently used in conditional expressions. Do not create an index for columns with fewer values. For example, in the "gender" column of the employee table, there are only two different values: "male" and "female", so there is no need to create an index. If an index is created, the query efficiency is not improved, but the update speed is greatly reduced.
● If there are multiple columns to be sorted, you can create a compound index on these columns ).
● Use system tools. For example, the Informix database has a tbcheck tool that can be checked on suspicious indexes. On some database servers, the index may be invalid or the reading efficiency may be reduced due to frequent operations. If an index-based Query slows down, you can use the tbcheck tool to check the index integrity, fix the issue if necessary. In addition, when a database table updates a large amount of data, deleting and re-indexing can increase the query speed.

2. Avoid or simplify sorting

Duplicate sorting of large tables should be simplified or avoided. When indexes can be used to automatically generate output in the appropriate order, the optimizer avoids the sorting step. The following are some influencing factors:
● The index does not contain one or more columns to be sorted;
● The Order of columns in the group by or order by clause is different from that of the index;
● Sort columns from different tables.
In order to avoid unnecessary sorting, We need to correctly add indexes and reasonably merge database tables (although it may affect table standardization sometimes, it is worthwhile to Improve the efficiency ). If sorting is unavoidable, you should try to simplify it, such as narrowing the column range of sorting.

3. Eliminates sequential access to data in large table rows

In nested queries, sequential access to a table may have a fatal impact on query efficiency. For example, the sequential access policy is used to create a nested layer-3 query. IF 1000 rows are queried at each layer, 1 billion rows of data are queried. The primary way to avoid this is to index the connected columns. For example, two tables: Student table (student ID, name, age ......) And Course Selection form (student ID, course number, score ). If you want to connect two tables, you need to create an index on the join field "student ID.

Union can also be used to avoid sequential access. Although all check columns are indexed, some forms of where clauses force the optimizer to use sequential access. The following query forces sequential operations on the orders table:
Select * from orders where (customer_num = 104 and order_num> 1001) or order_num = 1008
Although indexes are created on customer_num and order_num, the optimizer still uses sequential access paths to scan the entire table in the preceding statement. Because this statement is used to retrieve the set of separated rows, it should be changed to the following statement:
Select * from orders where customer_num = 104 and order_num> 1001
Union
Select * from orders where order_num = 1008
In this way, you can use the index path to process queries.

4. Avoid related subqueries

The label of a column appears in both the primary query and the where clause query. It is very likely that after the column value in the primary query changes, the subquery must perform a new query. The more nested query layers, the lower the efficiency. Therefore, avoid subqueries as much as possible. If the subquery is unavoidable, filter as many rows as possible in the subquery.

5. Avoid difficult Regular Expressions

Matches and like keywords support wildcard matching, technically called regular expressions. However, this matching is especially time-consuming. Example: Select * from customer where zipcode like "98 ___"
Even if an index is created on the zipcode field, sequential scanning is used in this case. If you change the statement to select * from customer where zipcode> "98000", the query will be executed using the index, which will obviously increase the speed.
In addition, avoid non-starting substrings. For example, if select * from customer where zipcode [2, 3]> "80" is used in the WHERE clause, non-starting substrings are used. Therefore, this statement does not use indexes.

6. Use temporary tables to accelerate queries

Sort a subset of a table and create a temporary table, which sometimes accelerates query. It helps avoid multiple sorting operations and simplifies the optimizer's work in other aspects. For example:
Select Cust. Name, rcvbles. Balance ,...... Other Columns
From Cust, rcvbles
Where Cust. customer_id = rcvlbes. customer_id
And rcvblls. Balance> 0
And Cust. postcode> 98000"
Order by Cust. Name
If this query is executed multiple times but more than once, you can find all the unpaid customers in a temporary file and sort them by customer name:
Select Cust. Name, rcvbles. Balance ,...... Other Columns
From Cust, rcvbles
Where Cust. customer_id = rcvlbes. customer_id
And rcvblls. Balance> 0
Order by Cust. Name
Into temp cust_with_balance
Then, query the temporary table in the following way:
Select * From cust_with_balance
Where postcode> 98000"

The temporary table has fewer rows than the primary table, and the physical order is the required order, which reduces disk I/O, so the query workload can be greatly reduced. Note: after a temporary table is created, the modification to the primary table is not reflected. Do not lose data when the data in the master table is frequently modified.

7. Use sorting to replace non-sequential access

Non-sequential disk access is the slowest operation, as shown in the back-and-forth movement of the disk inventory arm. SQL statements hide this situation, making it easy for us to write a query that requires access to a large number of non-sequential pages when writing an application.
In some cases, the database sorting capability can be used to replace non-sequential access to improve queries.

Instance analysis

The following is an example of a manufacturing company to illustrate how to optimize queries. The database of the manufacturing company contains three tables. The mode is as follows:
1. part table
part number description other columns
(part_num) (part_desc) (Other column)
102,032 seageat 30g disk ......
500,049 novel 10 m network card ......
......
2. Vendor table
other vendor name columns of vendor No.
(vendor _ num) (vendor_name) (Other column)
910,257 seageat Corp ......
523,045 IBM corp ......
......
3. part quantity table
part No. Manufacturer no.
(part_num) (vendor_num) (part_amount)
2573,450,000, 0014,000,000
,
......

The following query runs regularly on these tables and generates a Report on the quantity of all parts:
Select part_desc, vendor_name, part_amount
From part, vendor, parven
Where part. part_num = parven. part_num
And parven. vendor_num = vendor. vendor_num
Order by part. part_num
If no index is createdCodeThe overhead will be huge. Therefore, we create an index on the part number and the manufacturer number. Index creation avoids repeated scanning in nesting. Statistics on tables and indexes are as follows:
Table row size number of rows per page number of data pages
(Table) (row size) (row count) (rows/pages) (Data Pages)
Part15010, 00025400
Vendor1501, 000 2540
Parven13 15,000300 50
Index key size per page key quantity page quantity
(Indexes) (key size) (Keys/Page) (leaf pages)
Part450020
Vendor45002
Parven825060
It seems to be a relatively simple 3-table join, but its query overhead is very high. You can see from the system table that there is a cluster index on part_num and vendor_num, so the index is stored in the physical order. The parven table does not have a specific storage order. The size of these tables indicates that the success rate of non-sequential access from the buffer page is very small. The optimal query plan for this statement is: first read 400 pages from the part sequence, and then access the parven table unordered for 10 thousand times, 2 pages each time (one index page and one data page), a total of 20 thousand disk pages, and 15 thousand non-sequential access to the vendor table, combined with 30 thousand disk pages. It can be seen that the cost of disk access on this index is 50.4 thousand times.

In fact, we can improve the query efficiency by using a temporary table in three steps:

1. read data from the parven table in the order of vendor_num:
Select part_num, vendor_num, price
From parven
Order by vendor_num
Into temp pv_by_vn
This statement reads parven (50 pages) sequentially, writes a temporary table (50 pages), and sorts it. Assume that the sorting overhead is 200 pages, which is 300 pages in total.

2. Connect the temporary table to the vendor table, output the result to a temporary table, and sort by part_num:
Select pv_by_vn, * vendor. vendor_num
From pv_by_vn, vendor
Where pv_by_vn.vendor_num = vendor. vendor_num
Order by pv_by_vn.part_num
Into TMP pvvn_by_pn
Drop table pv_by_vn
This query reads pv_by_vn (50 pages) and uses indexes to access the vendor table for 15 thousand times. However, due to the vendor_num order, in fact, the vendor table is read in an indexed ORDER (40 + 2 = 42 pages). The output table contains about 95 rows on each page, with a total of 160 pages. Writing and accessing these pages triggers 5*160 = 800 reads and writes, and the index reads and writes 892 pages.

3. Connect the output and the part to get the final result:

Select pvvn_by_pn. *, part. part_desc
From pvvn_by_pn, part
Where pvvn_by_pn.part_num = part. part_num
Drop table pvvn_by_pn

In this way, the query reads pvvn_by_pn sequentially (160 pages) and reads the part table 15 thousand times through the index. As the index is built, 1772 disk reads and writes are actually performed. The optimized ratio is. I did the same experiment on Informix Dynamic sever and found that the time consumption optimization ratio is (if the data volume is increased, the proportion may be larger ).

conclusion
20% of the Code takes 80% of the time. This is a well-known law in programming, as well as in database applications. Our optimization focuses on SQL Execution efficiency for database applications. The key aspect of query optimization is to make the database server read less data from the disk and read pages in sequence rather than in an unordered manner.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.