The source of this article: http://www.cnblogs.com/wy123/p/6008477.html
For statistics on the number of rows of data estimates, previously written on the non-related columns (separate or separate index columns) of the algorithm, refer to here.
Today, let's write the statistical information on the calculation methods and potential problems of the composite index at the time of estimation.
This paper comes from a real business problem, a SQL in the use of a conforming index to do the query, it is found that there will always be a large estimated error,
and change the column order of the composite index, the error of the estimated row number will change, the reason for it?
Build a test environment first:
CREATE TABLE teststatistics ( COL1 int IDENTITY (1,1) , COL2 int , COL3 DATETIME , COL4 VARCHAR () ) Goinsert into Teststatistics VALUES (RAND ( )*, CAST (GETDATE ()-rand () *1000000
Problem Recurrence
First look at a very interesting question,
On the same table,
Let's build this one. Index: CREATE index idx_col2_col3 on Teststatistics(col2,col3)
Executes a query that is estimated to be 4127.86 rows
Then drop the index above and continue creating an index: Create index idx_col3_col2 on Teststatistics(col3,col2)
Note that the order of COL2 and COL3 is inconsistent
Continue execution of the above query (the query condition is unchanged, the data is unchanged, only the index column order has changed), this time is estimated to be 2414.91 row
Query conditions, as well as data, why does changing the order of compound index columns affect the estimate of the data rows of the execution plan?
First, consider the pre-estimate method at the first index:
This query is estimated to be 4127.86 rows, such as
It is estimated that the statistical information can not be separated from, first of all, to see the idx_col2_col3 Index of statistics,
We know that for composite indexes, only the statistics of the leading columns in the statistics, that is, the Idx_col3_col2 index only COL2 This column of statistics, as follows
For col2=2 statistics, the statistic is 100336 lines, we remember this number
Another feature of statistics is that statistics are automatically created on query columns (non-indexed columns), as follows
During query execution, a statistic named: _wa_sys_00000003_24e8431a is created automatically.
This statistic is the statistic for the COL3 column, and you can find the number of rows that are greater than or equal to 2012-10-20.
In SQL Server 2012, the estimate of the data rows is the product of the selectivity of each field,
If PN represents the density of the different fields, then the estimated number of rows is calculated by: Estimated number of rows =p0*p1*p2*p3......*rowcount
This algorithm can be used to calculate the estimated results under current data: 4217.86, consistent with the execution plan estimate, perfect!
When the Idx_col2_col3 rebuild is removed and the index of the COL3+COL2 is established, the following is estimated
The same query criteria as above, estimated to be 2414.91 rows
Based on the above analysis steps, first analyze the statistics on the index columns, as follows the estimated number of rows that are greater than or equal to 2016-10-20
Similarly, this query automatically establishes statistics on the COL2 column (the IDX_COL2_COL3 index is deleted), observing that this statistic estimates the col2=2 as 83711.36 rows
We also use the above formula to calculate the estimated number of rows: 2414.9035 lines, which also perfectly match and execute the projected results.
At this point, it should be clear that the beginning of the problem, is why the composite Index column order is inconsistent, when the query causes the estimate is inconsistent.
The most fundamental reasons are:
Conforms to the statistics of only the leading columns on the index, the query engine automatically creates statistics for non-leading columns as needed.
However, it is critical that, if you are careful, you will find that the number of sampled rows for statistics that are automatically created by the query engine is not 100% sampled.
It is precisely because the non-preamble sampling has a certain error, resulting in the pre-estimation method, when the estimated number of rows =p0*p1*p2*p3......*rowcount, the density value is not the same
That is, when creating idx_col2_col3, the COL2 density of the statistics is p1_1,col3 density is p2_1,
When creating idx_col3_col2, the COL2 density of the statistics is p1_2,col3 density is p2_2, because p1_1<>p1_2,p2_1<>p2_2
Therefore, the calculated result is p1_1*p2_1<>p2_1*p2_2, the principle is very simple, hope crossing can understand.
For this calculation, for two different indexes, if P1_1=p2_1 and p2_1=p2_2, then the product is the same, the estimated number of rows is the same, then is not it?
For two indexes in different order, first look at the index of COL2,COL3 order
After the query once (statistical information is established), perform a fully sampled (with FULLSCAN) statistics update
Look again at the estimated number of rows, which is estimated to be: 2894.49
Deletes the index of the Col2,col3 order, establishes the col3,col2 as the sequential index
After the query once (statistical information is established), perform a fully sampled (with FULLSCAN) statistics update
Re-look at the estimated number of rows, which is estimated to be: Same as 2894.49, which matches the algorithm above
Summarize:
The text simply demonstrates the algorithms and principles used to estimate the execution plan's use of statistics, as well as the interference factors that may be affected when calculating the estimated number of rows.
This requires that when we build the index, it's not just that I'm finished building a composite index, but also the effect of the order of the indexed columns on the execution plan estimates.
More importantly, be aware of the extent to which statistical information generated automatically by the query engine affects the estimate.
The statistics on the index, all are rogue. Aside from statistical information, the percentage of sampling is also a rogue.
My technical ability is also very food, write the wrong place also ask you crossing point, thank you.
SQL Server Execution plan uses statistics to estimate the data rows (why the composite index column order affects the estimate of the data rows for the execution plan)