Source: Statistics in SQL Server histogram for no coverage of predicate predictions and changes in the estimation strategy (SQL2012-->SQL2014-->SQL2016)
The source of this article: http://www.cnblogs.com/wy123/p/6770258.html
Statistics have written a few related articles, feeling or not enjoyable, about the statistics of the problem, recently stepped on the pit, the problem although not very common, but also more interesting.
Relative to SQL Server 2012, there are some notable changes in the new version of SQL Server (2014,2016), and this is a superficial analysis.
SQL Server 2012 (including previous versions), SQL Server is always estimated to be 1 rows when the data in the table changes, but the statistics are not yet updated, for predicate filtering that is not covered in the histogram
This method of estimation varies in SQL Server 2014 and server 2016, and in terms of performance, each version is different for the estimate of the predicate filtering that is not covered.
This article simply tests the different performance of this scenario in SQL Server 2012,SQL Server 2014,SQL Server 2016, and the potential impact that the problem may have.
The following are the database versions of the test environment involved:
Test environment Preparation
First use the following script, build a test table, write test data, the following will explain the distribution of test data
Create TableA (Identifieridint Identity(1,1), Id1int, Id2int, OthercolCHAR( -))GObegin TranDeclare @i int = 1 while @i<=1000000begin Insert intoAValues((@i/50000)+1,@i,NEWID()) Set @i = @i+1 if(@i%500000)=0 begin if @ @TRANCOUNT>0 begin Commit begin Tran End EndEndif @ @TRANCOUNT>0begin CommitEndGO
The distribution of the inserted test data is as follows, Id1 is from 1~20, each Id1 corresponds to 50,000 different Id2
Estimate of the predicate that is covered in the statistical information histogram
Test: Make a query based on any ID in the histogram, create statistics on related columns before querying, and find that the estimated number of rows is absolutely accurate.
To view the statistics on idx_1, the absolute accuracy of the above estimates is attributed to the sampling statistics of statistics 100% and the eq_rows of Rang_hi_key, and the Id1 distribution in the histogram is 1~21
Estimates of predicates not covered in the statistics histogram
Continue to insert a data that is not the same as the above Id2, here is 50, because the insertion of 50000 rows of data at this time is not enough to trigger the statistics update, so after the following write data, the statistics will not be updated.
So after this insert is complete, the statistics are not updated.
Because statistical information is not updated, in the idx_1 histogram, there is no id1=50 information, also said Id1=50 does not exist in the statistical information histogram,
Estimated results in SQL Server 2012:1 rows, actual 50000 rows
Repeat the above test code, tested in SQL Server 2014 and SQL Server 2016, without repeating the
The tests in SQL Server 2014 are as follows: The row is estimated to be 1024.7, the actual is 50000,
How is this value calculated? We haven't found the information yet.
The tests in SQL Server 2016 are as follows: an estimate of 49880.8, actually 50000, is basically close to the real value.
The accuracy of this estimate appears to be relatively hanging relative to the estimated results of SQL Server 2012 and 2014.
Why is the estimate in SQL Server 2016 so accurate?
Because in SQL Server 2016, for a filter predicate that does not exist in the histogram, when querying with this predicate, the relevant statistics are automatically updated and then the query is executed.
This feature, compared to SQL Server 2012 and 2014来, is new and very practical.
SQL Server 2014 This predictive strategy, while making some improvements on the basis of 2012, does not solve the intrinsic problem, so that people still have to intervene in the updating of statistical information artificially.
In SQL Server 2016, even the changed data rows in the current table have not reached the condition of the Touch Statistics update threshold (traditionally called thresholds, 500+rowcount*20%),
Statistical information will still be updated under the driver of the query, through the statistics on the index can be seen, reference, histogram generated a 50 of the statistics.
Here's what's called Triggering statistics update thresholds (strictly speaking, this rule is only valid for versions prior to SQL Server 2016, not for SQL Server 2016)
1, the table from no data to have greater than or equal to 1 data.
2, for a table with a data volume less than 500 rows, when the first field data of the statistics cumulative change is greater than 500.
3, for a table with a data volume greater than 500 rows, when the first field data of the statistics cumulative change is greater than the number of + (20%x tabular data total).
This argument is valid for versions prior to SQL Server 2016 and is not true for versions after SQL Server 2016, which I think is worth noting.
The statistics update policy in SQL Server 2016 corresponds to the Traceflag 2371 that was turned on in previous versions, reference http://www.cnblogs.com/wy123/p/5748933.html
That is, the decision to change the statistical information is dynamic, no longer adhere to the "cumulative change in data is greater than the number of + (20%x Tabular data total)" this restriction.
In addition, there should be other mechanisms, such as those triggered by the queries here.
Problems caused by
Why did Microsoft change the update strategy for statistics in SQL Server 2016, and why did the author explore the issue?
Of course in the actual business is the problem of the pit of the egg pain.
The problem is obvious, similar to the test scenario, in SQL Server 2012 (including previous versions), this predictive strategy has a serious flaw.
such as in the example:
Because there is no statistics for the current filter predicate (or no statistics for the current predicate), it is estimated to be 1 rows in the case of an actual 5000 rows.
This predictive strategy is very outrageous, and in some cases it can cause serious performance problems, and it's easy to guess, but it's just a little bit less.
Here's a quick and simple explanation of what's causing the problem and why.
Here are just a few examples of how these problems can cause performance problems and how serious the impact is. The following test is performed under SQL Server 2012.
To demonstrate this problem, first make another test table B and write the test data.
Create TableB (Identifieridint Identity(1,1), Id2int, OthercolChar( -))GObegin TranDeclare @i int = 1 while @i<=1000000begin Insert intoBValues(@i,NEWID()) Set @i = @i+1 if(@i%100000)=0 begin if @ @TRANCOUNT>0 begin Commit begin Tran End EndEndif @ @TRANCOUNT>0begin CommitEndGO
Create INDEX idx_2 on B (ID2)
GO
Use the second table to make a test to enlarge the defect caused by the wrong estimated number of rows,
Execute the following two sql, querying A.ID1 = 5 and A.ID1 = 50 respectively for information,
By the data distribution, the total number of results of the query will be exactly the same (the number of rows affected),
While A.ID1 = 5 and A.ID1 = 50 have exactly the same amount of data and distribution, the latter's logical IO goes far beyond the former.
It is because the histogram does not have A.ID1 = 50 statistics, A.ID1 = 50 is incorrectly estimated to be caused by 1 rows.
The specific reason is very clear, understand the implementation plan of the students should be very clear.
Because the number of rows filtered by the current predicate is incorrectly estimated, the index lookup is used to query the data on a table.
It turns out that, in the current situation, this is one way to be more inefficient than a full table scan (see logical IO).
In addition, a table after the drive B after the process, because the estimate is a row, the use of nested loop way to drive the B-table to do the connection,
In fact, the nested loop is not the best in the current situation, and it can be said to be very bad.
This can also be attributed to statistical information in the histogram without filtering the statistics on the predicate, in the first phase of the estimate is incorrectly estimated to be caused by 1 rows.
The more painful thing about this problem is that when you check the session or cache execution plan, you'll find that, on the surface, the execution plan is good, and it's indexed.
For example, the second SQL execution plan, seems to be no problem, but also easy to directly ignore the cause of the problem,
Thus turning the focus to other places makes the problem more difficult to discern. The problem is that the index is used where the index is used incorrectly and where the index should not be used.
This is the first step in the execution of the plan selection errors, resulting in every step of the wrong situation (one step wrong, step-by-step error), in reality, SQL is more complex, the amount of data is larger, the impact is also greater.
If a few more tables join in the above example, a exclusively nested loop will appear to drive the table connection, so the SQL execution time and logical IO are very high.
Attach a test under SQL Server 2016, which is visible by default, the execution plan makes the right choice.
At last:
1, this article is not to say index, about the index is not much to say.
2, although the scene of this article is not too common, slightly special, but also actually encountered, it can be seen that Microsoft is also gradually improve the SQL Server optimizer update statistics policy.
3, on the projections under this scenario, in different versions, there are many interesting questions have not been thrown, there is a chance to say.
4, similar problems can occur only in the case of relatively large amount of data, if it is 100,000 or hundreds of thousands of of the amount of data, the database is a small amount of data, the impact of similar problems on the performance is not fully reflected.
5, if someone according to the test of this article to verify, please pay attention to one detail: the estimation of the filter predicate, in the following two cases, the two cases in 2012 and 2014 (2016) The way of estimating is also different
1, the table does not have the data for this predicate, and the statistics are not updated, such as Id1 = 50 of the data is 0 rows of the case of the estimate
2, the table has the data of this predicate, the same statistic information is not updated, such as Id1 = 50 of the data is 50000 rows of the case of the estimate
Summarize:
The estimate of SQL Server has a critical impact on the execution plan generation, and the estimate relies on statistics, so the update and accuracy of the statistics is particularly important. In view of this, SQL Server has a big change in the generation of statistics and the update strategy in each version, this article only from a small point, to verify the statistical information in each version of SQL Server estimates and updates of the characteristics of some of the potential impact of similar problems can be found , as well as some improvements in SQL Server 2016.
Changes in the statistics histogram in SQL Server for which there is no coverage to predicate predictions and predictive policies (sql2012-->sql2014-->sql2016)