Subquery Optimization
A good and commendable rule is to try to replace all subqueries with connections. The optimizer can sometimes "flat" subqueries automatically and replace them with regular or external connections. But that is not always valid. Explicit connections provide more options for selecting the sequence of the table and finding the most likely plan. When you optimize a special query, It is very different to know whether to remove the self-query.
Example
The following query selects the names of all tables in the pubs database and the clustered indexes of each table (if any ). If no clustered index exists, the table name is still displayed in the list, and the clustered index column is displayed as a dotted line. The two queries return the same result set, but the first uses a subquery, and the second uses an external connection. Compare query plans generated by Microsoft SQL Server
SUBQUERY SOLUTION
----------------------
SELECT st. stor_name AS 'store ',
(Select sum (bs. qty)
FROM big_sales AS bs
WHERE bs. stor_id = st. stor_id), 0)
AS 'books Sold'
FROM stores AS st
WHERE st. stor_id IN
(Select distinct stor_id
FROM big_sales)
JOIN SOLUTION
----------------------
SELECT st. stor_name AS 'store ',
SUM (bs. qty) AS 'books Sold'
FROM stores AS st
JOIN big_sales AS bs
ON bs. stor_id = st. stor_id
WHERE st. stor_id IN
(Select distinct stor_id
FROM big_sales)
Group by st. stor_name
SUBQUERY SOLUTION
----------------------
SQL Server parse and compile time:
CPU time = 28 MS
Elapsed time = 28 MS
SQL Server Execution Times:
CPU time = 145 MS
Elapsed time = 145 MS
Table 'Big _ sales'. Scan count 14, logical reads
1884, physical reads 0, read-ahead reads 0.
Table 'stores'. Scan count 12, logical reads 24,
Physical reads 0, read-ahead reads 0.
JOIN SOLUTION
----------------------
SQL Server parse and compile time:
CPUs time = 50 MS
Elapsed time = 54 MS
SQL Server Execution Times:
CPU time = 109 MS
Elapsed time = 109 MS
Table 'Big _ sales'. Scan count 14, logical reads
966, physical reads 0, read-ahead reads 0.
Table 'stores'. Scan count 12, logical reads 24,
Physical reads 0, read-ahead reads 0.
Without further exploration, we can see that the connection is faster in terms of CPU and total real-time consumption, and only half of the logic read of the subquery solution is required. In addition, these two cases are accompanied BY the same result set. Although the sorting ORDER is different, this is because the join query (because of its group by clause) has an implicit order:
Store Books Sold
-------------------------------------------------
Barnum's 154125
Bookbeat 518080
Doc-U-Mat: Quality Laundry and Books 581130
Eric the reading Books 76931
Fricative Bookshop 259060
Message & Brews 161090
(6 row (s) affected)
Store Books Sold
-------------------------------------------------
Eric the reading Books 76931
Barnum's 154125
Message & Brews 161090
Doc-U-Mat: Quality Laundry and Books 581130
Fricative Bookshop 259060
Bookbeat 518080
(6 row (s) affected)
View the query plan displayed by this subquery method:
| -- Compute Scalar (DEFINE :( [Expr1006] = isnull ([Expr1004], 0 )))
| -- Nested Loops (Left Outer Join, outer references :( [st]. [stor_id])
| -- Nested Loops (Inner Join, outer references :( [big_sales]. [stor_id])
| -- Stream Aggregate (group by :( [big_sales]. [stor_id])
| -- Clustered Index Scan (OBJECT :( [pubs]. [dbo]. [big_sales].
[UPKCL_big_sales]), ordered forward)
| -- Clustered Index Seek (OBJECT :( [pubs]. [dbo]. [stores]. [UPK_storeid]
AS [st]),
SEEK :( [st]. [stor_id] = [big_sales]. [stor_id]) ordered forward)
| -- Stream Aggregate (DEFINE :( [Expr1004] = SUM ([bs]. [qty])
| -- Clustered Index Seek (OBJECT :( [pubs]. [dbo]. [big_sales].
[UPKCL_big_sales] AS [bs]),
SEEK :( [bs]. [stor_id] = [st]. [stor_id]) ordered forward)
On the other hand, we can obtain the following for the sum query operation:
| -- Stream Aggregate (group by :( [st]. [stor_name])
DEFINE :( [Expr1004] = SUM ([partialagg1005])
| -- Sort (order by :( [st]. [stor_name] ASC ))
| -- Nested Loops (Left Semi Join, outer references :( [st]. [stor_id])
| -- Nested Loops (Inner Join, outer references :( [bs]. [stor_id])
| -- Stream Aggregate (group by :( [bs]. [stor_id])
DEFINE :( [partialagg1005] = SUM ([bs]. [qty])
| -- Clustered Index Scan (OBJECT :( [pubs]. [dbo]. [big_sales].
[UPKCL_big_sales] AS [bs]), ordered forward)
| -- Clustered Index Seek (OBJECT :( [pubs]. [dbo]. [stores].
[UPK_storeid] AS [st]),
SEEK :( [st]. [stor_id] = [bs]. [stor_id]) ordered forward)
| -- Clustered Index Seek (OBJECT :( [pubs]. [dbo]. [big_sales].
[UPKCL_big_sales]),
SEEK :( [big_sales]. [stor_id] = [st]. [stor_id]) ordered forward)
Using connections is a more effective solution. It does not require additional stream aggregate (stream aggregate), that is, the sum of values required for the subquery in the big_sales.qty column.
UNION vs UNION ALL
Whenever possible, use union all instead of UNION. The difference is that UNION does not exclude duplicate rows and sorts the results. You need to create a temporary worksheet to select the result of no repeated rows. Use it to sort all rows and sort them before output. (In a select distinct query, a query plan is displayed, and a stream aggregation is found, consuming more than 30% of the resources to process the query ). If you know exactly what you need, you can use UNION. But if you estimate that there are no duplicate rows in the result set, use union all. It is only selected from a table or a join, and then selected from another table, appended to the bottom of the first result set. Union all does not require worksheets and sorting (unless caused by other conditions ). In most cases, union all is more efficient. A potentially dangerous problem is that the use of UNION will generate a huge flood of temporary worksheets in the database. If you want to obtain a large number of result sets from the UNION query, this may happen.
Example
The following query is the ID of all the stores in the table sales in the pubs database, and the ID of all the stores in the table big_sales. More than 70,000 rows of data are added to this table. The difference between the two solutions is only the comparison of the use of UNION and union all. However, adding the ALL keyword in this plan produces three major differences. In the first scheme, stream aggregation and sorting are required before the result set is returned to the client. The second query is more efficient, especially for large tables. In this example, Two Queries return the same result set, although the order is different. There are two temporary tables in our test. Your results may be slightly different.
UNION SOLUTION
-----------------------
UNION ALL SOLUTION
-----------------------
SELECT stor_id FROM big_sales
UNION
SELECT stor_id FROM sales
----------------------------
SELECT stor_id FROM big_sales
UNION ALL
SELECT stor_id FROM sales
----------------------------
| -- Merge Join (Union)
| -- Stream Aggregate (group:
([Big_sales]. [stor_id])
| -- Clustered Index Scan
(OBJECT :( [pubs]. [dbo].
[Big_sales].
[UPKCL_big_sales]),
Ordered forward)
| -- Stream Aggregate (group:
([Sales]. [stor_id])
| -- Clustered Index Scan
(OBJECT :( [pubs]. [dbo].
[Sales]. [UPKCL_sales]),
Ordered forward)
| -- Concatenation
| -- Index Scan
(OBJECT :( [pubs]. [dbo].
[Big_sales]. [ndx_sales_ttlID])
| -- Index Scan
(OBJECT :( [pubs]. [dbo].
[Sales]. [titleidind])
UNION SOLUTION
-----------------------
Table 'sales'. Scan count 1, logical
Reads 2, physical reads 0,
Read-ahead reads 0.
Table 'Big _ sales'. Scan count 1,
Logical
Reads 463, physical reads 0,
Read-ahead reads 0.
UNION ALL SOLUTION
-----------------------
Table 'sales'. Scan count 1, logical
Reads 1, physical reads 0,
Read-ahead reads 0.
Table 'Big _ sales'. Scan count 1,
Logical
Reads 224, physical reads 0,
Read-ahead reads 0.
Although the result set in this example is interchangeable, you can see that the union all statement consumes half of the resources less than the UNION statement. Therefore, use the union all clause if your result set is expected and it is determined that there are no duplicates.
Function and expression constraints Index
When you use built-in functions or expressions in the index column, the optimizer cannot use the indexes of these columns. Rewrite these conditions as much as possible. Do not include index columns in expressions.
Example
You should help SQL Server remove any expressions around index value columns. The following query is a row selected from the table jobs using the unique key value of the unique clustered index. If you use an expression in this column, this index does not work. However, once you set the condition 'job _ id-2 = 0' to 'job _ id = 2', the optimizer performs the seek operation on the clustered index.
QUERY WITH SUPPRESSED INDEX
-----------------------
OPTIMIZED QUERY USING INDEX
-----------------------
SELECT *
FROM jobs
WHERE (job_id-2) = 0
SELECT *
FROM jobs
WHERE job_id = 2
| -- Clustered Index Scan (OBJECT:
([Pubs]. [dbo]. [jobs].
[PK _ jobs _ 117F9D94]),
WHERE :( Convert ([jobs]. [job_id])-
2 = 0 ))
| -- Clustered Index Seek (OBJECT:
([Pubs]. [dbo]. [jobs].
[PK _ jobs _ 117F9D94]),
SEEK :( [jobs]. [job_id] = Convert ([@ 1])
Ordered forward)
Note that a SEEK is much better than a SCAN,
As in the previous query.
The following table lists various types of query examples, which are forbidden to use column indexes. At the same time, a rewrite method is provided to achieve better performance.
QUERY WITH SUPPRESSED INDEX
---------------------------------------
OPTIMIZED QUERY USING INDEX
--------------------------------------
DECLARE @ job_id VARCHAR (5)
SELECT @ job_id = '2'
SELECT *
FROM jobs
Where convert (VARCHAR (5 ),
Job_id) = @ job_id
-------------------------------
DECLARE @ job_id VARCHAR (5)
SELECT @ job_id = '2'
SELECT *
FROM jobs
WHERE job_id = CONVERT (
SMALLINT, @ job_id)
-------------------------------
SELECT *
FROM authors
WHERE au_fname + ''+ au_lname
= 'Johnson white'
-------------------------------
SELECT *
FROM authors
WHERE au_fname = 'johnson'
AND au_lname = 'white'
-------------------------------
SELECT *
FROM authors
Where substring (au_lname, 1, 2) = 'wh'
-------------------------------
SELECT *
FROM authors
WHERE au_lname LIKE 'wh %'
-------------------------------
Create index employee_hire_date
ON employee (hire_date)
GO
-- Get all employees hired
-- In the 1st quarter of 1990:
SELECT *
FROM employee
Where datepart (year, hire_date) = 1990
And datepart (quarter, hire_date) = 1
-------------------------------
Create index employee_hire_date
ON employee (hire_date)
GO
-- Get all employees hired
-- In the 1st quarter of 1990:
SELECT *
FROM employee
WHERE hire_date> = '2017/123'
AND hire_date <'2014/1/123'
-------------------------------
-- Suppose that hire_date may
-- Contain time other than 12 AM
-- Who was hired on 2/21/1990?
SELECT *
FROM employee
Where convert (CHAR (10 ),
Hire_date, 101) = '2017/123'
-- Suppose that hire_date may
-- Contain time other than 12 AM
-- Who was hired on 2/21/1990?
SELECT *
FROM employee
WHERE hire_date> = '2017/123'
AND hire_date <'2014/1/123'
SET NOCOUNT ON
The use of set nocount on to speed up T-SQL code surprised SQL Server developers and database system administrators. You may have noticed that a successful query returns system information about the affected number of rows. In many cases, you do not need this information. This set nocount on command allows you to disable all subqueries in your session transaction until you issue set nocount off.
This option is not only about the decorative effect of the output. It reduces the amount of information transmitted from the server to the client. Therefore, it helps reduce network traffic and increase the overall response time of your transactions. The time for passing a single message can be ignored, but in this case, a script executes some queries in a loop and sends thousands of bytes of useless information to the user.
For example, a file contains a T-SQL batch that inserts 9999 rows in the big_sales table.
-- Assumes the existence of a table called BIG_SALES, a copy of pubs .. sales
SET NOCOUNT ON
DECLARE @ separator VARCHAR (25 ),
@ Message VARCHAR (25 ),
@ Counter INT,
@ Ord_nbr VARCHAR (20 ),
@ Order_date DATETIME,
@ Store_nbr INT,
@ Qty_sold INT,
@ Terms VARCHAR (12 ),
@ Title CHAR (6 ),
@ Starttime DATETIME
SET @ STARTTIME = GETDATE ()
SELECT @ counter = 0,
@ Separator = REPLICATE ('-', 25)
WHILE @ counter< 9999
BEGIN
SET @ counter = @ counter + 1
SET @ ord_nbr = 'y' + CAST (@ counter as varchar (5 ))
SET @ order_date = DATEADD (hour, (@ counter * 8), 'Jan 01 1999 ')
SET @ store_nbr =
Case when @ counter <999 THEN '000000'
WHEN @ counter BETWEEN 1000 AND 2999 THEN '123'
WHEN @ counter BETWEEN 3000 AND 3999 THEN '123'
WHEN @ counter BETWEEN 4000 AND 6999 THEN '123'
WHEN @ counter BETWEEN 7000 AND 7999 THEN '123'
WHEN @ counter BETWEEN 8000 AND 9999 THEN '123'
ELSE '000000'
END
SET @ qty_sold =
Case when @ counter BETWEEN 0 AND 2999 THEN 11
WHEN @ counter BETWEEN 3000 AND 5999 THEN 23
ELSE 37
END
SET @ terms =
Case when @ counter BETWEEN 0 AND 2999 THEN 'net 30'
WHEN @ counter BETWEEN 3000 AND 5999 THEN 'net 60'
ELSE 'on invoice'
END
-- SET @ title = (SELECT title_id FROM big_sales WHERE qty = (select max (qty)
FROM big_sales ))
SET @ title =
Case when @ counter <999 THEN 'mc222222'
WHEN @ counter BETWEEN 1000 AND 1999 THEN 'mc222222'
WHEN @ counter BETWEEN 2000 AND 3999 THEN 'mc3026'
WHEN @ counter BETWEEN 4000 AND 5999 THEN 'ps2106'
WHEN @ counter BETWEEN 6000 AND 6999 THEN 'ps777777'
WHEN @ counter BETWEEN 7000 AND 7999 THEN 'tc3218'
ELSE 'ps1372'
END
-- PRINT @ separator
-- SELECT @ message = STR (@ counter, 10) -- + STR (SQRT (CONVERT (FLOAT,
@ Counter), 10, 4)
-- PRINT @ message
BEGIN TRAN
Insert into [pubs]. [dbo]. [big_sales] ([stor_id], [ord_num], [ord_date],
[Qty], [payterms], [title_id])
VALUES (@ store_nbr, CAST (@ ord_nbr as char (5), @ order_date, @ qty_sold,
@ Terms, @ title)
COMMIT TRAN
END
SET @ message = CAST (DATEDIFF (MS, @ starttime, GETDATE () as varchar (20 ))
PRINT @ message
/*
TRUNCATE table big_sales
Insert into big_sales
SELECT * FROM sales
SELECT title_id, sum (qty)
FROM big_sales
Group by title_id
Order by sum (qty)
SELECT * FROM big_sales
*/
When the set nocount off command is executed, the actual consumption time is 5176 milliseconds. When the set nocount on command is executed, the actual consumption time is 1620 milliseconds. If you do not need to output the number of rows, add the set nocount on command at the beginning of each stored procedure and script.
TOP and SET ROWCOUNT
The TOP clause in the SELECT statement limits the number of rows returned by a single query, while set rowcount limits the number of rows affected by subsequent queries. These commands provide high efficiency in many programming tasks.
Set rowcount to SET the maximum number of rows that can be affected in SELECT, INSERT, update or delete statements. These settings take effect immediately when the command is executed and only affect the current session. To remove this restriction, run set rowcount 0.
In some actual tasks, TOP or set rowcount is more efficient for programming than standard SQL commands. Let's prove in several examples:
TOP n
One of the most popular queries in almost all databases is to request the first N items in a list. In the pubs database case, we can find the first five items that sell the best CD. Compare TOP, set rowcount and ansi SQL.
Pure ansi SQL:
Select title, ytd_sales
From titles
Where (select count (*)
From titles B
Where B. ytd_sales> a. ytd_sales
) <5
Order by ytd_sales DESC
This pure ansi SQL scheme executes an association subquery with low efficiency. In this example, no index is supported on ytd_sales. In addition, this pure standard SQL command does not filter the null values that fall into ytd_sales, and there is no difference between multiple cdns.
Use set rowcount:
Set rowcount 5
SELECT title, ytd_sales
FROM titles
Order by ytd_sales DESC
Set rowcount 0
Use TOP n:
Select top 5 title, ytd_sales
FROM titles
Order by ytd_sales DESC
The second solution uses set rowcount to stop the SELECT query. The third solution uses TOP n to stop the first five rows. In this case, we also need an order by clause to forcibly sort the entire table before obtaining the results. The two query plans are actually the same. However, the key point of TOP over set rowcount is that SET must process the worksheet required BY the order by clause, rather than TOP.
In a large table, we can create an index on ytd_sales to avoid sorting. The query will use this index to locate the first five rows and stop. Compared with the first scheme, It scans the entire table and executes an associated subquery for each row. In small tables, the performance difference is very small. However, in a large table, the processing time of the first solution may be several hours, and the last two methods are several seconds.
When determining whether a query is required, consider whether only a few rows are required. If yes, using the TOP clause will save a lot of time.