Massive Data Query Optimization

Source: Internet
Author: User
Tags informix

I went to Microsoft for an interview this afternoon and was asked about the optimization of massive data queries. I am not sure how to answer the questions due to the small amount of application data I have developed and I am not very concerned about performance optimization. Two articles on the optimization of overseas data query are searched online.

Database optimization query plan Method

Database systems are the core of management information systems. Database-based online transaction processing (OLTP) and Online Analytical Processing (OLAP) it is one of the most important computer applications of banks, enterprises, government departments, and other departments. From the perspective of most system application instances, query operations account for the largest proportion in various database operations, and the SELECT statement based on query operations is the most costly statement in SQL statements. For example, if the amount of data is accumulated to a certain extent, for example, if the database table information of a bank account is accumulated to millions or even tens of millions of records, it usually takes tens of minutes to scan a full table, even several hours. If you use a better query policy than full table scan, you can usually reduce the query time to several minutes. This shows the importance of the query optimization technology.
During the implementation of application projects, it was found that many programmers only pay attention to the gorgeous user interface when developing database applications using some front-end database development tools (such as PowerBuilder and Delphi, the efficiency of query statements is not valued, resulting in low efficiency of the developed application system and serious waste of resources. Therefore, it is very important to design efficient and reasonable query statements. Based on the application example and the database theory, this article introduces the application of the query optimization technology in the real system.
Analyze problems
Many Programmers think that query optimization is a task of DBMS (Database Management System). It is not relevant to the SQL statements compiled by programmers. This is wrong. A good query plan can often increase the program performance by dozens of times. A query plan is a set of SQL statements submitted by a user. A query plan is a set of statements generated after optimization. The process of the DBMS processing the query plan is as follows: after the lexical and syntax check of the query statement is completed, the statement is submitted to the query optimizer of the DBMS, after the optimizer completes algebra optimization and access path optimization, the pre-compilation module processes the statements and generates query plans, and then submits the statements to the system for processing and execution at the appropriate time, finally, return the execution result to the user. In high versions of actual data library products (such as Oracle and Sybase), cost-based optimization is adopted, this optimization can estimate the price of different query plans based on the information obtained from the system dictionary table, and then select a better plan. Although the database products have been doing better and better in query optimization, the SQL statements submitted by users are the basis for system optimization, it is hard to imagine that a bad query plan will become efficient after the system is optimized, so the merits and demerits of the statements written are crucial. The following describes the solutions to improve the query plan.
Solve the problem
The following uses Informix as an example to describe how to improve the user query plan.
1. Use indexes reasonably
An index is an important data structure in a database. Its fundamental goal is to improve query efficiency. Currently, most database products adopt the isam index structure first proposed by IBM. The index should be used properly. The usage principles are as follows:
● The optimizer automatically generates an index for fields that are frequently connected but not specified as foreign keys.
● Index the columns that are frequently sorted or grouped (that is, group by or order by operations.
● Create a search for columns with different values that are frequently used in conditional expressions. Do not create an index for columns with fewer values. For example, in the "gender" column of the employee table, there are only two different values: "male" and "female", so there is no need to create an index. If an index is created, the query efficiency is not improved, but the update speed is greatly reduced.
● If there are multiple columns to be sorted, you can create a compound index on these columns ).
● Use system tools. For example, the Informix database has a tbcheck tool that can be checked on suspicious indexes. On some database servers, the index may be invalid or the reading efficiency may be reduced due to frequent operations. If an index-based Query slows down, you can use the tbcheck tool to check the index integrity, fix the issue if necessary. In addition, when a database table updates a large amount of data, deleting and re-indexing can increase the query speed.
2. Avoid or simplify sorting
Duplicate sorting of large tables should be simplified or avoided. When indexes can be used to automatically generate output in the appropriate order, the optimizer avoids the sorting step. The following are some influencing factors:
● The index does not contain one or more columns to be sorted;
● The order of columns in the group by or order by clause is different from that of the index;
● Sort columns from different tables.
In order to avoid unnecessary sorting, We need to correctly add indexes and reasonably merge database tables (although it may affect table standardization sometimes, it is worthwhile to Improve the efficiency ). If sorting is unavoidable, you should try to simplify it, such as narrowing the column range of sorting.
3. Eliminates sequential access to data in large table rows
In nested queries, sequential access to a table may have a fatal impact on query efficiency. For example, the sequential access policy is used to create a nested layer-3 query. IF 1000 rows are queried at each layer, 1 billion rows of data are queried. The primary way to avoid this is to index the connected columns. For example, two tables: Student table (student ID, name, age ......) And Course Selection form (student ID, course number, score ). If you want to connect two tables, you need to create an index on the join field "student ID.
Union can also be used to avoid sequential access. Although all check columns are indexed, some forms of where clauses force the optimizer to use sequential access. The following query forces sequential operations on the orders table:
SELECT * FROM orders WHERE (customer_num = 104 AND order_num> 1001) OR order_num = 1008
Although indexes are created on customer_num and order_num, the optimizer still uses sequential access paths to scan the entire table in the preceding statement. Because this statement is used to retrieve the set of separated rows, it should be changed to the following statement:
SELECT * FROM orders WHERE customer_num = 104 AND order_num> 1001
UNION
SELECT * FROM orders WHERE order_num = 1008
In this way, you can use the index path to process queries.
4. Avoid related subqueries
The label of a column appears in both the primary query and the where clause query. It is very likely that after the column value in the primary query changes, the subquery must perform a new query. The more nested query layers, the lower the efficiency. Therefore, avoid subqueries as much as possible. If the subquery is unavoidable, filter as many rows as possible in the subquery.
5. Avoid difficult Regular Expressions
MATCHES and LIKE keywords support wildcard matching, technically called regular expressions. However, this matching is especially time-consuming. Example: SELECT * FROM customer WHERE zipcode LIKE "98 ___"
Even if an index is created on the zipcode field, sequential scanning is used in this case. If you change the statement to SELECT * FROM customer WHERE zipcode> "98000", the query will be executed using the index, which will obviously increase the speed.
In addition, avoid non-starting substrings. For example, if SELECT * FROM customer WHERE zipcode [2, 3]> "80" is used in the where clause, non-starting substrings are used. Therefore, this statement does not use indexes.
6. Use temporary tables to accelerate queries
Sort a subset of a table and create a temporary table, which sometimes accelerates query. It helps avoid multiple sorting operations and simplifies the optimizer's work in other aspects. For example:
SELECT cust. name, rcvbles. balance ,...... Other columns
FROM cust, rcvbles
WHERE cust. customer_id = rcvlbes. customer_id
AND rcvblls. balance> 0
AND cust. postcode> 98000"
Order by cust. name
If this query is executed multiple times but more than once, you can find all the unpaid customers in a temporary file and sort them by customer name:
SELECT cust. name, rcvbles. balance ,...... Other columns
FROM cust, rcvbles
WHERE cust. customer_id = rcvlbes. customer_id
AND rcvblls. balance> 0
Order by cust. name
Into temp cust_with_balance
Then, query the temporary table in the following way:
SELECT * FROM cust_with_balance
WHERE postcode> 98000"
The temporary table has fewer rows than the primary table, and the physical order is the required order, which reduces disk I/O, so the query workload can be greatly reduced.
Note: after a temporary table is created, the modification to the primary table is not reflected. Do not lose data when the data in the master table is frequently modified.
7. Use sorting to replace non-sequential access
Non-sequential disk access is the slowest operation, as shown in the back-and-forth movement of the disk inventory arm. SQL statements hide this situation, making it easy to write a query that requires access to a large number of non-sequential pages when writing an application.
In some cases, the database sorting capability can be used to replace non-sequential access to improve queries.

 

Instance analysis

The following is an example of a manufacturing company to illustrate how to optimize queries. The database of the manufacturing company contains three tables. The mode is as follows:
1. part table
Parts part description other columns
(Part_num) (part_desc) (other column)
102,032 Seageat 30G disk ......
500,049 Novel 10 M network card ......
......
2. vendor table
Vendor ID, vendor name, other columns
(Vendor _ num) (vendor_name) (other column)
910,257 Seageat Corp ......
523,045 IBM Corp ......
......
3. parven table
Part No. Manufacturer no. Part quantity
(Part_num) (vendor_num) (part_amount)
102,032 910,257 3,450,000
234,423 321,001 4,000,000
......
The following query runs regularly on these tables and generates a Report on the quantity of all parts:
SELECT part_desc, vendor_name, part_amount
FROM part, vendor, parven
WHERE part. part_num = parven. part_num
AND parven. vendor_num = vendor. vendor_num
Order by part. part_num
If no index is created, the overhead of the preceding query code will be huge. Therefore, we create an index on the part number and the manufacturer number. Index creation avoids repeated scanning in nesting. Statistics on tables and indexes are as follows:
Table row size number of rows per page number of data pages
(Table) (row size) (Row count) (Rows/Pages) (Data Pages)
Part 150 10,000 25 400
Vendor 150 1,000 25 40
Parven 13 15,000 300 50
Index key size per page key quantity page quantity
(Indexes) (Key Size) (Keys/Page) (Leaf Pages)
Part 4, 500, 20
Vendor 4 500 2
Parven 8 250 60
It seems to be a relatively simple 3-table join, but its query overhead is very high. You can see from the system table that there is a cluster index on part_num and vendor_num, so the index is stored in the physical order. The parven table does not have a specific storage order. The size of these tables indicates that the success rate of non-sequential access from the buffer page is very small. The optimal query plan for this statement is: first read 400 pages from the part sequence, and then access the parven table unordered for 10 thousand times, 2 pages each time (one index page and one data page), a total of 20 thousand disk pages, and 15 thousand non-sequential access to the vendor table, combined with 30 thousand disk pages. It can be seen that the cost of disk access on this index is 50.4 thousand times.
In fact, we can improve the query efficiency by using a temporary table in three steps:
1. read data from the parven table in the order of vendor_num:

SELECT part_num, vendor_num, price
FROM parven
Order by vendor_num
INTO temp pv_by_vn
This statement reads parven (50 pages) sequentially, writes a temporary table (50 pages), and sorts it. Assume that the sorting overhead is 200 pages, which is 300 pages in total.
2. Connect the temporary table to the vendor table, output the result to a temporary table, and sort by part_num:
SELECT pv_by_vn, * vendor. vendor_num
FROM pv_by_vn, vendor
WHERE pv_by_vn.vendor_num = vendor. vendor_num
Order by pv_by_vn.part_num
Into tmp pvvn_by_pn
Drop table pv_by_vn
This query reads pv_by_vn (50 pages). It accesses the vendor Table 15 thousand times through the index. However, due to the vendor_num order, in fact, the vendor table is read in an indexed ORDER (40 + 2 = 42 pages). The output table contains about 95 rows on each page, with a total of 160 pages. Writing and accessing these pages triggers 5*160 = 800 reads and writes, and the index reads and writes 892 pages.
3. Connect the output and the part to get the final result:
SELECT pvvn_by_pn. *, part. part_desc
FROM pvvn_by_pn, part
WHERE pvvn_by_pn.part_num = part. part_num
Drop table pvvn_by_pn
In this case, the query reads pvvn_by_pn sequentially (160 pages) and reads the part table 15 thousand times through the index. Because of the index, 1772 disk reads and writes are actually performed. The optimized ratio is. I did the same experiment on Informix Dynamic Sever and found that the time consumption optimization ratio is (if the data volume is increased, the proportion may be larger ).

Knot

20% of the Code took 80% of the time, a well-known law in programming, as well as in database applications. Our optimization focuses on SQL Execution efficiency for database applications. The key aspect of query optimization is to make the database server read less data from the disk and read pages in sequence rather than in an unordered manner.

30 tips for querying and optimizing millions of data

1. To optimize the query, try to avoid full table scanning. First, consider creating an index on the columns involved in where and order.

2. Try to avoid null value determination on the field in the where clause. Otherwise, the engine will discard the index and perform full table scanning, for example:
Select id from t where num is null
You can set the default value 0 on num to make sure that the num column in the table does not have a null value, and then query it like this:
Select id from t where num = 0

3. Try to avoid using it in the where clause! = Or <> operator. Otherwise, the engine will discard the index for full table scanning.

4. Try to avoid using or in the where clause to connect to the condition. Otherwise, the engine will discard the index and perform full table scanning, for example:
Select id from t where num = 10 or num = 20
You can query it as follows:
Select id from t where num = 10
Union all
Select id from t where num = 20

5. Use in and not in with caution. Otherwise, a full table scan may occur, for example:
Select id from t where num in (1, 2, 3)
For continuous values, you can use between instead of in:
Select id from t where num between 1 and 3

6. The following query will also cause a full table scan:
Select id from t where name like '% abc %'
To improve efficiency, you can consider full-text search.

7. If a parameter is used in the where clause, a full table scan is performed. Because SQL parses local variables only at runtime, the optimizer cannot postpone the selection of the access plan to runtime; it must be selected at compilation. However, if an access plan is created during compilation, the value of the variable is still unknown and thus cannot be used as an input for index selection. The following statement performs a full table scan:
Select id from t where num = @ num
You can change it to force query to use the index:
Select ID from T with (index name) where num = @ num

8. Avoid performing expression operations on fields in the WHERE clause as much as possible. This will cause the engine to discard the use of indexes for full table scanning. For example:
Select ID from t where num/2 = 100
Should be changed:
Select ID from t where num = 100*2

9. Avoid performing function operations on fields in the WHERE clause as much as possible, which will cause the engine to stop using the index for full table scanning. For example:
Select ID from t where substring (name, 1, 3) = 'abc' -- id whose name starts with ABC
Select ID from t where datediff (day, createdate, '2017-11-30 ') = 0 -- '2017-11-30' generated ID
Should be changed:
Select ID from t where name like 'abc %'
Select ID from t where createdate> = '2014-11-30 'and createdate <'2014-12-1'

10. do not perform functions, arithmetic operations, or other expression operations on the left side of "=" in the WHERE clause. Otherwise, the system may not be able to correctly use the index.

11. when using an index field as a condition, if the index is a composite index, you must use the first field in the index as the condition to ensure that the system uses the index, otherwise, the index will not be used, and the field order should be consistent with the index order as much as possible.

12. Do not write meaningless queries. If you need to generate an empty table structure:
Select col1, col2 into # T from t where 1 = 0
This type of code will not return any result set, but will consume system resources, should be changed to this:
Create Table # T (...)

13. In many cases, replacing in with exists is a good choice:
Select num from a where num in (select num from B)
Replace the following statement:
Select num from a where exists (select 1 from B where num = A. Num)

14. not all indexes are valid for queries. SQL queries are optimized based on the data in the table. When there is a large number of duplicate data in the index column, SQL queries may not use indexes, for example, if a table contains sex fields, male and female are almost half of each other, indexing sex does not play a role in query efficiency.

15. the more indexes, the better. Although the index can improve the efficiency of the SELECT statement, it also reduces the efficiency of insert and update, because the insert or update statements may recreate the index, therefore, you need to carefully consider how to create an index, depending on the actual situation. It is recommended that the number of indexes in a table be no more than six. If there are too many indexes, consider whether the indexes on some columns that are not frequently used are necessary.

16. update the clustered index data column should be avoided as much as possible, because the order of the clustered index data column is the physical storage order of the table records. Once the column value changes, the order of the entire table record will be adjusted, it will consume a considerable amount of resources. If the application system needs to frequently update the clustered index data column, consider whether to create the index as a clustered index.

17. use numeric fields whenever possible. If fields containing only numerical information are not designed as numeric fields, this will reduce query and connection performance and increase storage overhead. This is because the engine compares each character in the string one by one during query and connection processing, and only one comparison is required for the number type.

18. try to use varchar/nvarchar instead of char/nchar, because the first step is to reduce the storage space of the variable-length field, which can save storage space. Secondly, for queries, searching in a relatively small field is obviously more efficient.

19. Do not use select * from t anywhere, replace "*" with a specific field list, and do not return any fields that are not used.

20. Try to use table variables instead of temporary tables. If the table variable contains a large amount of data, note that the index is very limited (only the primary key index ).

21. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.

22. Temporary tables are not unavailable. Using them appropriately can make some routines more effective. For example, when you need to reference large tables or a data set in common tables repeatedly. However, it is best to use the export table for one-time events.

23. when creating a temporary table, if a large amount of data is inserted at one time, you can use select into instead of create table to avoid creating a large number of logs to increase the speed. If the data volume is small, to ease system table resources, create table first and then insert.

24. if a temporary table is used, you must explicitly delete all temporary tables at the end of the stored procedure. First truncate the table and then drop the table, so that the system table can be locked for a long time.

25. Avoid using a cursor whenever possible, because the efficiency of the cursor is poor. If the cursor operation has more than 10 thousand rows of data, you should consider rewriting.

26. before using the cursor-based or temporary table method, you should first find a set-based solution to solve the problem. The set-based method is generally more effective.

27. Like a temporary table, the cursor is not unavailable. Using a FAST_FORWARD cursor for a small dataset is usually better than other row-by-row processing methods, especially when several tables must be referenced to obtain the required data. A routine that includes "sum" in the result set is usually faster than a cursor. If this is allowed during development, you can try both the cursor-based method and the set-based method to see which method works better.

28. Set nocount on at the beginning of all stored procedures and triggers, and set nocount off at the end. You do not need to send the done_in_proc message to the client after executing each statement of the stored procedure and trigger.

29. Avoid large transaction operations as much as possible to improve the system concurrency capability.

30. Avoid returning a large amount of data to the client as much as possible. If the data volume is too large, consider whether the corresponding requirements are reasonable.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.