In SQL Server, subqueries can be divided into correlated subqueries and unrelated subqueries, but not-in clauses are common for unrelated subqueries, but the not-in clause leads to the following two types of problems:
- Inaccurate results
- Poor query performance
Let's take a look at why we try not to use the NOT in clause.
The problem of inaccurate results
In SQL Server, a null value is not a value, but rather a specific meaning, the meaning of which is "Unknow", which can be understood as undefined or unknown, so that any two-dollar operator result that is compared to a null value must be NULL, including the null value itself. In SQL Server, however, the meaning of the null value is converted to type bool with the result of false. Let's look at a simple example, shown in 1.
Figure 1. Null value compared to any value results are null
SQL Server provides a comparison of the "is" operator with a null value to measure whether a value is null.
So what's the problem with not in, 2?
Figure 2. Not in produces inaccurate values
In Figure 2, condition 3 does not belong to any of the list in the back of not, and the query does not return any values, unlike the expected result, the reason is that the non in clause is handling null values in SQL Server, the not as shown in Figure 2 The IN clause can actually be converted to a query as shown in 3.
Figure 3: Equivalent conversions are possible for the NOT in clause
In Figure 3, you can see that not in can be converted to a condition for each value of the unequal pair, and with logic and connection, and the previous mentioned null value compared with any other value, the result is always null, in the Where condition is false, so 3<> Null causes no rows to be returned, resulting in a not-in clause that produces unexpected results.
Therefore, the NOT in clause if it comes from a table or a list is long, and even if a null value exists in a large number of values, it can cause the final result to return no data.
Solution?
The workaround is to use NOT exists as an alternative instead of not. The EXISTS operator does not return NULL, and only returns TRUE or false based on each row in the subquery, and when a null value is encountered, only false is returned, not a null value that causes the entire subquery expression to be null. For the query shown in Figure 2, we can rewrite it as a subquery, as shown in 4.
Figure 4. Not exists can return results correctly
Low query performance due to not in
As we can see earlier, the main problem with not in is because of the handling of null values, so why is the processing of null values causing performance problems? Let's take a look at the example in Figure 5. In Figure 5, we used the Adventurework sample database and, for demonstration purposes, changed the definition of the ProductID of the SalesOrderDetail table from NOT NULL to NULL, at which point we made a simple not-in query. As shown in 5.
Figure 5. Execution plan for not in
In Figure 5, we see a row Count spool operator, which is used to confirm that there are null values in the ProductID column (the procedure is to compare total rows and non-null rows, not to wait for null values, although we know that there are no null values in the column. However, because the column definition allows null, SQL Server must make additional acknowledgments, which takes up nearly half of the query cost. So we compare the not exists,6.
Figure 6. Not in Vs not Exists
As can be seen from Figure 6, the execution cost of not in is almost 3 times times that of not exists, simply because SQL Server needs to confirm that NULL is present in the Allow null column. According to the equivalent form in Figure 3, not in, we can completely convert the not into the equivalent not exist form, 7 is shown.
Figure 7. Not in conversion to not Exists
Let's compare the cost of Figure 7 and its equivalent not in query, as shown in 8.
Figure 8: Completely equivalent in cost
So we can see that not in requires extra steps to process the null value, which is only defined in the ProductID column in the SalesOrderDetail table to allow NULL, If we also define SalesOrderHeader's SalesOrderID column to allow NULL, we find that SQL Server also requires an additional cost to confirm that there are null values on the column. As shown in 9.
Figure 9. SQL Server solves the problem of nullable columns by joining the left Anti Semi join operator
The not-in corresponding equivalent not-exist form is now shown in code Listing 1.
SELECT *
from Sales.SalesOrderHeader A
WHERE not EXISTS SELECT *
from Sales.SalesOrderDetail b
WHERE A.salesorderid = B.productid)
and not EXISTS SELECT *
from Sales.SalesOrderDetail b
WHERE is NULL
) )
and not EXISTS SELECT 1
from SELECT *
from Sales.SalesOrderHeader
as C
WHERE is NULL )
Code Listing 1. When NULL is allowed for a JOIN column two column definition, the not-in equivalent of not EXISTS form
At this point we are simply comparing the IO cases of not and not exists, as shown in 10.
Figure 10. Not in eats very high IO
Summary
This paper expounds the implementation principle of not in and the data inconsistency and performance problems, so as to avoid using not in while writing query, and converting to the not exists equivalent form provided by this paper will reduce a lot of trouble.
Why not using a subquery in SQL Server is not recommended