SQL Massive Data Optimization)

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Note:
1. Try to avoid null value determination on the field in the where clause. Otherwise, the engine will discard the index and perform full table scanning, for example:
Select id from t where num is null
You can set the default value 0 on num to make sure that the num column in the table does not have a null value, and then query it like this:
Select id from t where num = 0
2. Try to avoid using it in the where clause! = Or <> operator. Otherwise, the engine will discard the index for full table scanning. The optimizer cannot use indexes to determine the number of rows to be hit. Therefore, you need to search all rows in the table.
3. Try to avoid using or in the where clause to connect to the condition. Otherwise, the engine will discard the index and perform full table scanning, for example:
Select id from t where num = 10 or num = 20
You can query it as follows:
Select id from t where num = 10
Union all
Select id from t where num = 20
4. in and not in should also be used with caution, because IN will make the system unable to use the index, but can only directly search the data in the table. For example:
Select id from t where num in (1, 2, 3)
For continuous values, you can use between instead of in:
Select id from t where num between 1 and 3
5. Try to avoid using non-start letters to search for indexed character data. This also makes the engine unable to use the index.
See the following example:
SELECT * FROM T1 where name like '% L %'
SELECT * FROM T1 where substing (NAME, 2, 1) = 'l'
SELECT * FROM T1 where name like 'l %'
Even if the NAME field has an index, the first two queries still cannot use the index to accelerate the operation. The engine has to perform operations on all data in the table one by one to complete the task. The third query can use indexes to speed up operations.
6. If necessary, force the query optimizer to use an index. If a parameter is used in the where clause, full table scan may occur. Because SQL parses local variables only at runtime, the optimizer cannot postpone the selection of the access plan to runtime; it must be selected at compilation. However, if an access plan is created during compilation, the value of the variable is still unknown and thus cannot be used as an input for index selection. The following statement performs a full table scan:
Select id from t whereNum = @ num
You can change it to force query to use the index:
Select id from t with (index name) whereNum = @ num
7. Avoid performing expression operations on fields in the where clause as much as possible, which will cause the engine to discard the use of indexes for full table scanning. For example:
SELECT * FROM T1 WHERE F1/2 = 100
Should be changed:
SELECT * FROM T1 WHERE F1 = 100*2
SELECT * from record where substring (CARD_NO, 5378) = '20140901'
Should be changed:
SELECT * from record where CARD_NO LIKE '201312'
SELECT member_number, first_name, last_name FROM members
Where datediff (yy, datofbirth, GETDATE ()> 21
Should be changed:
SELECT member_number, first_name, last_name FROM members
WHERE dateofbirth <DATEADD (yy,-21, GETDATE ())
That is, any operation on the column will cause the table to scan, including database functions, calculation expressions, etc. During the query, try to move the operation to the right of the equal sign.
8. Avoid performing function operations on fields in the where clause as much as possible. This will cause the engine to stop using the index for full table scanning. For example:
Select id from t where substring (name, 1, 3) = 'abc' -- id whose name starts with abc
Select id from t where datediff (day, createdate, '2017-11-30 ') = 0 -- '2017-11-30' generated id
Should be changed:
Select id from t where name like 'abc %'
Select id from t where createdate> = '2014-11-30 'and createdate <'2014-12-1'
9. Do not perform functions, arithmetic operations, or other expression operations on the left side of "=" in the where clause. Otherwise, the system may not be able to correctly use the index.
10. when using an index field as a condition, if the index is a composite index, you must use the first field in the index as the condition to ensure that the system uses the index, otherwise, the index will not be used, and the field order should be consistent with the index order as much as possible.
11. Using exists is often a good choice:
Select num from a where num in (select num from B)
Replace the following statement:
Select num from a where exists (select 1 from B where num = a. num)
Select sum (T1.C1) FROM T1 WHERE (
(Select count (*) FROM T2 WHERE T2.C2 = T1.C2> 0)
Select sum (T1.C1) FROM T1WHERE EXISTS (
SELECT * FROM T2 WHERE T2.C2 = T1.C2)
The two produce the same results, but the latter is obviously more efficient than the former. Because the latter will not generate a large number of locked table scans or index scans.
If you want to check whether a record exists in the table, do not use count (*) as inefficient and waste server resources. It can be replaced by EXISTS. For example:
IF (select count (*) FROM table_name WHERE column_name = 'xxx ')
Can be written:
If exists (SELECT * FROM table_name WHERE column_name = 'xxx ')
You often need to write a T_ SQL statement to compare a parent result set and a child result set, so as to find the records that exist in the parent result set but not in the Child result set, such:
SELECT a. hdr_key FROM hdr_tbl a ---- tbl a indicates that tbl is replaced by alias.
Where not exists (SELECT * FROM dtl_tbl B WHERE a. hdr_key = B. hdr_key)
SELECT a. hdr_key FROM hdr_tbl
Left join dtl_tbl B ON a. hdr_key = B. hdr_key WHERE B. hdr_key IS NULL
SELECT hdr_key FROM hdr_tbl
WHERE hdr_key not in (SELECT hdr_key FROM dtl_tbl)
The three writing methods can get the same correct results, but the efficiency is reduced in turn.
12. Try to use table variables instead of temporary tables. If the table variable contains a large amount of data, note that the index is very limited (only the primary key index ).
13. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.
14. Temporary tables are not unavailable. Using them appropriately can make some routines more effective. For example, when you need to reference large tables or a data set in common tables repeatedly. However, it is best to use the export table for one-time events.
15. when creating a temporary table, if a large amount of data is inserted at one time, you can use select into instead of create table to avoid creating a large number of logs to increase the speed. If the data volume is small, to ease system table resources, create table first and then insert.
16. if a temporary table is used, you must explicitly delete all temporary tables at the end of the stored procedure. First truncate the table and then drop the table, so that the system table can be locked for a long time.
17. set nocount on at the beginning of all stored procedures and triggers, and set nocount off at the end. You do not need to send the DONE_IN_PROC message to the client after executing each statement of the stored procedure and trigger.
18. Avoid large transaction operations as much as possible to improve the system concurrency capability.
19. Avoid returning a large amount of data to the client as much as possible. If the data volume is too large, consider whether the corresponding requirements are reasonable.
20. Avoid using incompatible data types. For example, float and int, char and varchar, binary, and varbinary are incompatible. Data Type incompatibility may make the optimizer unable to perform some optimization operations that can be performed originally. For example:
SELECT name FROM employee WHERE salary> 60000
In this statement, if the salary field is of the money type, it is difficult for the optimizer to optimize it because 60000 is an integer. We should convert the integer type into a coin type during programming, instead of waiting for the conversion at runtime.
21. Make full use of the connection conditions. In some cases, there may be more than one connection condition between two tables. In this case, write the complete connection conditions in the WHERE clause, which may greatly improve the query speed.
Example:
Select sum (A. AMOUNT) from account a, card B where a. CARD_NO = B. CARD_NO
Select sum (A. AMOUNT) from account a, card B where a. CARD_NO = B. CARD_NO and a. ACCOUNT_NO = B. ACCOUNT_NO
The second statement is much faster than the first statement.
22. Use view to accelerate query
Sort a subset of a table and create a view, which sometimes accelerates query. It helps avoid multiple sorting operations and simplifies the optimizer's work in other aspects. For example:
SELECT cust. name, rcvbles. balance ,...... Other columns
FROM cust, rcvbles
WHERE cust. customer_id = rcvlbes. customer_id
AND rcvblls. balance> 0
AND cust. postcode> 98000"
Order by cust. name
If this query is executed multiple times but more than once, you can find all the unpaid customers in a view and sort them by customer name:
Create view dbo. V_CUST_RCVLBES
AS
SELECT cust. name, rcvbles. balance ,...... Other columns
FROM cust, rcvbles
WHERE cust. customer_id = rcvlbes. customer_id
AND rcvblls. balance> 0
Order by cust. name
Then, query in the view in the following way:
SELECT * FROM V_CUST_RCVLBES
WHERE postcode> 98000"
The number of rows in the view is smaller than that in the primary table, and the physical order is the required order, which reduces disk I/O, so the query workload can be greatly reduced.
23. If you can use DISTINCT, you do not need group.
SELECT OrderID FROM Details WHERE UnitPrice> 10 group by OrderID
You can change it:
Select distinct OrderID FROM Details WHERE UnitPrice> 10
24. Do not use UNION if union all is used.
Union all does not execute the select distinct function, which reduces unnecessary resources.
35. Try not to use the select into statement.
The select inot statement will lock the table and prevent other users from accessing the table.
The above mentioned are some basic notes for improving the query speed. However, in more cases, we often need to experiment with different statements repeatedly to obtain the best solution. The best way is to test the SQL statement that implements the same function, which has the least execution time. However, if the data volume in the database is small, it cannot be compared. In this case, you can view the execution plan, that is, you can obtain multiple SQL statements that implement the same function to the query analyzer, press CTRL + L to view the indexes used for query, and the number of table scans (the two have the greatest impact on performance ), in general, check the cost percentage.

I read an article on itput today to discuss the optimization of a statement:
Original post address:Http://www.itpub.net/viewthread.php? Tid = 1015964 & extra = & page = 1
I. Statements for Problem Optimization:
I would like to explain how to optimize the following statements:
Create table aa_001
(Ip VARCHAR2 (28 ),
Name VARCHAR2 (10 ),
Password VARCHAR2 (30 ))
Select * from aa_001 where ip in (1, 2, 3) order by name desc;
-- Currently, there are about more than 10 million records in the table, and the number of values in is uncertain.
The above are the statements and conditions to be optimized.
Many people follow the post: Some say there is no way to optimize, some say that IN should be EXISTS, some say that index composite indexes (ip, name) on ip address, and so on.
2. Ask the question. can the problem be optimized? How can it be optimized? We will discuss this issue today.
3. Problem 1 is analyzed, with more than 10 million data records.
2. The number of values in is unknown.
3.1 Data Distribution Analysis the author did not mention the data distribution in the ip column. Currently, the data distribution in the ip column may be as follows:
1. ip column (data is unique, or the probability of data duplication is very small)
2. ip column)
3. ip column (data distribution is relatively uniform and a large amount of data is duplicated, mainly the same data (may only have tens of thousands of different ip data levels)
Solution:
1. For the first data distribution, you only need to create an index in the ip column. In this case, no matter how many rows there are in the table, the number of in statements is very fast.
2. corresponding to the data distribution in the second table, it is ineffective to create an index in the ip column. Data distribution is uneven, which may be fast or slow.
3. corresponding to the third type of data distribution, it is certainly slow to create an index in the ip column.
Note: Here, order by name desc is sorted after data is retrieved. Instead of sorting data before Data Extraction
For two or three cases, because a large amount of data may need to be retrieved, the optimizer uses table scan instead of index search, which is slow, in this case, the table scan efficiency is superior to index search, especially in the case of high concurrency.
How to deal with the problem. Is to change in to exists. In fact, the efficiency of the optimizer in SQL server 2005 and oracle is the same when there is little data behind in. At this time, the general index efficiency is very low. At this time, if you create a clustered index on the ip column, the efficiency will be relatively high. We will perform a test in SQL server 2005.
Table: about 2 million data records in [dbo]. [[zping.com. Contains Userid, id, Ruleid, and other columns. Query similar statements based on the above conditions:
Select * from [dbo]. [[zping.com] where
Userid in ('402881410ca47925010cb329c7670ffb', '402881ba0d5dc94e010d5dced05a0008'
, '4028814111a735e90111a77fa8e30384 ') order by Ruleid desc
Let's first look at the distribution of userid data and execute the following statement:
Select userid, count (*) from [dbo]. [[zping.com] group by userid order by 2
Now let's look at the data distribution: there are a total of 379 pieces of data, both from 1 to 0.15 million, and the data distribution skew is serious. Is a part of it.

At this time, if a non-clustered index is created on the ip address, the efficiency is very low, and the index scan is forced, the efficiency is also very low, it will find that the number of I/O times is higher than the table scan. In this case, only clustered indexes can be created on the ip address. Now let's look at the results.
At this time, we found that the search uses a clustered search scan (clustered index seek.
Check the returned results of the query:
(Row 3 is affected)
Table '[zping.com]'. Scan count 8, logical reads 5877, physical reads 0, pre-reads 0, lob logic reads 0, lob physical reads 0, and lob pre-reads 0.
Table 'worktable '. Scan count 0, logical read 0, physical read 0, pre-read 0, lob logical read 0, lob physical read 0, lob pre-read 0.
If the return value is 0.15 million rows, less than 6 thousand I/O operations are performed. High efficiency, because these 0.15 million rows need to be sorted, and 51% of the query cost is sorted. Of course, you can create a (userid, Ruleid) composite clustered index to improve performance, but the DML maintenance cost is high. It is not recommended.
From the test example above, we can see that the optimization solution is as follows:
Data Distribution is 1: Create an ip index.
Data Distribution is 2, 3: Create a clustered index in the ip column.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

SQL Massive Data Optimization)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

SQL Massive Data Optimization)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support