Recently a friend and I explored whether statements about where 1=1 and this form would affect performance. The final conclusion is not affected.
Although the conclusion is correct, the understanding of the problem is far from the root of solving the problem. In fact, in the writing process of T-SQL statements often make a very narrow conclusion, and then dogmatic in the Bible, for the T-SQL field, on the internet can often see the so-called optimization code, randomly searched the Internet some excerpts as follows:
- Do not have more than 5 tables connected (join)
- Consider using temporary tables or table variables to hold intermediate results
- Use fewer sub-queries
- View nesting do not too deep, general view nesting not more than 2 advisable.
- Index The fields that appear in the WHERE clause
- Avoid using functions or calculations on indexed columns, in the WHERE clause, if the index is part of a function, the optimizer will no longer use the index for full table scanning
- A condition is added to the Insert and update dimension tables to filter records that already exist in the dimension table
- Use an explicit declaration to specify an index if the query does not go through the index when in or or is used
- exists is far more efficient than in.
..........
What's the problem?
While the above-mentioned guidance seems to be fine and not entirely correct, there are actually two major issues:
out of context : Many truths can only take effect within a context, and it is meaningless to break out of context. For example, usually someone told you that you have a little kidney deficiency, I think your first reaction must be a way to defend the dignity of the man, but if you go to the hospital to check the doctor said so, you may be a face of pious ask how to fill the:-), that the above excerpt of the statement example: 1) less with subqueries, if the SQL The XPath of the server operation XML is filtered by the node properties, that translates into subqueries will be faster 2) if you use in or OR and so on to find that the query does not walk the index, the use of explicit declaration to specify the index, the Query Analyzer does not walk the index must have its cause,
does not explain the essential reason : The Buddha language has the cloud "every phase, all is false, if see the phase non-phase, namely see the Tathagata". Take a look at the following story:
said that there was a two government officials together to see the disease, a call Kan, a call ish, two people's symptoms are the same, are headache, body fever, may be cold bar. But Hua Tuo said: "Kan should be used to treat, ish should use Khan method to cure (seek the present, delay when sweating)." "People think it's strange, you must think it's strange, why the same disease, the same symptoms, there will be different treatment?" Hua Tuo explained, he said: "Kan is outside the real, the extension is the real, so use a different way." "Sure enough, the next day, they were both ill."
In fact, it can be seen that the exact same symptoms, can be completely different reasons, on the contrary, the same reason, can also form a completely different "phase". If only to see the "phase" and take the stress treatment measures, often results will be unsatisfactory.
Think like Query Optimizer
In every field there are rules in their domain, and most simply, if you do not conform to the C # specification for programming, such as using keywords incorrectly, then the compilation will be an error. Of course, there will be some hidden rules in each area, and some people will say is the so-called "unspoken rules", such rules are often not materializing, such as you do not meet the best practice to write a program, compile will not error, but the resulting performance or security problem is that you need to follow best practices this "unspoken rules" to avoid.
In the SQL Server field, the T-SQL statement to the query result return needs to go through a full cycle, 1:
Figure 1. T-SQL life cycle
Therefore, in the relational database domain, the SQL statement is written as an abstract logic, rather than as a direct implementation of the programming language. For example, to access a row of data, if it is a programming language implementation, you need to specify how to connect the data, open the data, take the data in a certain way, and finally close the connection, and in SQL Server, T is just the definition of how to get the required data, without regard to implementation details.
Figure 1 goes through several steps from T-SQL to specific return data, and there are a number of rules for each step. Therefore, the performance problems caused by where 1=1 and in this article need to be considered in accordance with the rules of the query parser, which is also think like query optimizer.
In SQL Server, T-SQL needs to be compiled into execution plan to execute, in the compilation process, Query optimizer need to consider a lot of metadata, such as indexes on the table, data distribution, estimated number of rows, some parameter configuration, hardware environment, in which the most important is the estimated number of rows , SQL Server needs to estimate the number of rows to estimate the cost.
Where 1=1 and writing is not slowing down?
The query parser filters out the 1=1 directly in the algebraic tree optimization phase. This function is called "Constant folding" in the query optimizer.
We assume that the query parser does not directly filter where 1=1 in the algebraic tree optimization phase.
For example, the statement select * from table where a=1 and b=2 this statement, SQL Server estimates the number of rows will be:
Column A selection rate *b column selection rate * Total number of rows sampled in table
Therefore, when the where 1=1 and a=1, the result becomes
1*a Column Selection Rate * The total number of rows sampled in the table =a column number of rows sampled in the table
Therefore, the query parser estimates the same number of rows, regardless of whether it has 1=1 and, and thus has the same execution plan, so it does not affect performance.
When we understand how the query parser is going to estimate the number of rows for a and B, then we can figure out what happens to A and b that might cause the execution plan to be inaccurate. From a formula, SQL Server thinks that column A and column B are unrelated, and if A and B are very large, the estimated number of rows will be very inaccurate.
Here we give an example, if there are 1 million rows of data in the table, where A=1 has 10,000 data, where B=1 has 10,000 of the data, the selectivity of A and B is 1/100=0.01, and the estimated number of rows A and b in the where is changed to 0.01*0.01= 0.0001*100 = 100 rows, assuming that the data filtered by where a=1 and B=1 are the same 10,000 rows of data, the estimated number of rows is 100 and the actual number of rows is 10,000, which can cause performance problems due to inaccurate execution plans. Of course, this is a small number of cases, but often after the performance has a certain impact, so SQL Server 2014 new row number estimation using an exponential backoff algorithm, in this case will be estimated as 1000 rows, resulting in the possibility of performance problems will become smaller, 2014 index Backoff algorithm is not the focus of this article, So it's not much to say.
Why SQL statements where 1=1 and does not affect performance in SQL Server