The source of this article: http://www.cnblogs.com/wy123/p/6008477.html
For statistics on the number of rows of data estimates, previously written on the non-related columns (separate or separate index columns) of the algorithm, refer to here. Today, let's write the statistical information on the calculation methods and potential problems of the composite index at the time of estimation. The prototype of this paper comes from a real business problem, a SQL in the use of a conforming index to do the query, it is found that there will always be large estimates of error, and change the column order of the composite index, the estimated number of rows of error will change, that is, Create index idx_index1 on Ta Blename (col1,col2) and create index idx_index2 on TableName (col2,col1) query with exactly the same query criteria, the execution plan for two indexes is different for its estimated number of rows.
Build a test environment first:
CREATE TABLE teststatistics ( COL1 int IDENTITY (1,1) , COL2 int , COL3 DATETIME , COL4 VARCHAR (goinsert) toteststatistics VALUES (RAND () *10,cast (GETDATE ()-rand () *300 1000000
Problem Recurrence
First look at a very interesting question, on the same table, first build an index: CREATE index idx_col2_col3 on Teststatistics(COL2,COL3) executes a query, estimated to be 4127.86 rows Then drop the index above and continue creating an index: Create index idx_col3_col2 on Teststatistics(col3,col2) Note that the order of COL2 and COL3 continues to execute the above query (the query condition remains the same , the data does not change, only the index column order has changed), this time is estimated to be 2414.91 rows
Query conditions, as well as data, why does changing the order of compound index columns affect the estimate of the data rows of the execution plan?
First, consider the pre-estimate method at the first index:
This query is estimated to be 4127.86 rows, such as
Estimate, it is inseparable from statistical information, first of all, idx_col2_col3 This index of statistics, we know that for composite index, statistics only the leading column of statistics, that is, idx_col3_col2 this index only COL2 this column of statistics, as follows For col2=2 statistics, the statistic is 100336 lines, we remember this number
Another feature of statistics is the automatic creation of statistics on query columns (non-indexed columns), which is automatically created in the following query execution: _wa_sys_00000003_24e8431a Statistics This statistic is the statistic of the COL3 column, You can find the number of rows of statistics that are greater than or equal to 2012-10-20
In SQL Server 2012, the estimate of the data rows is the product of the selectivity of each field, and if the PN represents the density of the different fields, the estimated number of rows is calculated by: Estimated number of rows =p0*p1*p2*p3......*rowcount can use this algorithm, Calculate the current data, the estimated results: 4217.86, with the implementation plan estimates are consistent, very perfect!
When the Idx_col2_col3 rebuild is removed and the index of the COL3+COL2 is established, the following is estimated
The same query criteria as above, estimated to be 2414.91 rows
Based on the above analysis steps, first analyze the statistics on the index columns, as follows the estimated number of rows that are greater than or equal to 2016-10-20
Similarly, this query automatically establishes statistics on the COL2 column (the IDX_COL2_COL3 index is deleted), observing that this statistic estimates the col2=2 as 83711.36 rows
We also use the above formula to calculate the estimated number of rows: 2414.9035 lines, which also perfectly match and execute the projected results.
At this point, it should be clear that the beginning of the problem, is why the composite Index column order is inconsistent, when the query causes the estimate is inconsistent. The most fundamental reason is that the query engine automatically creates statistics for non-leading columns according to the statistics of only the leading columns on the index, but it is critical that, if you are careful, you will find that the number of sampled rows for statistics that are automatically created by the query engine is not 100% sampled, which is critical It is precisely because the non-preamble sampling has a certain error, resulting in the pre-estimation method, when the estimated number of rows =p0*p1*p2*p3......*rowcount, the density value is not the same, that is, when creating idx_col2_col3, the statistical COL2 density is p1_ 1,col3 density is p2_1, when creating idx_col3_col2, the COL2 density of the statistic is p1_2,col3, because p2_2 therefore, the calculated result is p1_1<>p1_2,p2_1<>p2_2 _1*p2_1<>p2_1*p2_2, the principle is very simple, hope crossing can understand.
According to this calculation, for two different order statistics, if P1_1=p2_1 and p2_1=p2_2, then the product is the same, the estimated number of rows is the same, then is not it?
For the two indexes in different order, first look at the index of the col2,col3 order after the query (established statistics), perform a fully sampled (with FULLSCAN) statistics update to see the estimated number of rows, this time is estimated to be: 2894.49
Delete the index of the col2,col3 order, establish an index of col3,col2 order after the query once (the statistics are established), perform a fully sampled (with FULLSCAN) statistics update to see the estimated number of rows again. This time the estimate is: Same as 2894.49, is consistent with the above algorithm
Summarize:
The text simply demonstrates the algorithms and principles that the execution plan uses to estimate the statistical information, and the interference factors that may be affected when calculating the estimated number of rows, which requires that when we build the index, it's not just that I'm done with building a composite index, but also that the order of the indexed columns affects the execution plan estimates, and more importantly, Notice how the statistics generated by the query engine affect the estimate.
Aside from statistical information on the index, all are bullying. Aside from statistical information sampling percentage, it is also bullying to talk about statistical information.
Another problem: When maintaining statistics, can you update only the statistics of the indexed columns, ignoring the statistics of non-indexed columns?
SQL Server Execution plan uses statistics to estimate the data rows (why the composite index column order affects the estimate of the data rows for the execution plan)