Premise
When this article discusses only SQL Server queries,
For non-composite statistics, where the statistics for each field contain only the data distribution of the
How to estimate the number of rows based on statistical information when combining queries with multiple fields.
The algorithm principle of estimating the number of data rows using statistics from different fields, and the differences between SQL Server 2012 and SQL Server 2014,
For the time being, no compound statistics are involved, and the update strategy and optimization related topics of statistical information are not involved, as well as other SQL Server version calculation methods.
What is the statistic about?
In short, it is a description of the data distribution of some fields, so that when SQL Server makes queries based on conditions, it probably knows the expected size of the data.
A database object that guides the generation of a reasonable execution plan
Classification of statistical information
Statistics are automatically created on the index, and SQL Server automatically creates indexes on some non-indexes based on specific queries, and of course you can create statistics manually.
First, to intuitively understand what the statistics look like, reference, that is,
_wa_sys_**** begins with the statistics that the system creates as needed,
The same name as the index is the statistics created on the index,
Manually creating statistics can also be self-named in cases where SQL Server naming requirements are met.
The following is the statistical information for the index.
The role of statistical information
The query engine makes a reasonable execution plan based on the data provided by the statistics.
So how does the query engine use statistical information to make predictions?
And what changes have been made to the earlier versions of SQL Server 2014 that you will mention below?
This article will do a simple analysis of the two points to explain how SQL Server estimates based on statistics, the following start of the text.
Test environment Setup
Habitually do a demo environment, create a table, write 100W data behind the test.
Create TableTeststatistics (Idint Identity(1,1), Status1int, Status2int, Status3int)
Insert intoTeststatisticsValues(RAND()* +,RAND()* -,RAND()* -)Go 1000000
There are four fields in the table, the first one is the self-increment column, the main one is the Status1,status2,status3 three fields,
The values of the three fields are multiplied by a constant coefficient of random numbers,
So the data distribution ranges for these three fields are
status1:0-999 (1000 Kinds of data distribution)
status2:0-249 (250 kinds of data distribution)
STATUS3:0-49 (50 kinds of data distribution)
This back is useful.
First Test in SQL Server 2012
Make a query first: SELECT * from Teststatistics where status1=885 and status2=88 and status3=8
After this query is completed, a three statistic is automatically created on the table,
These three statistics are the data distribution description of the three fields of STATUS1,STATUS2,STATUS3, respectively.
First, let's take a look at the details of this _wa_sys_00000002_0ea330e9, which is the statistic of the Status1 column,
Note that the all density field value, selectivity is how much or how unique the duplicate data of the field in a table is,
The calculation method is: The number of the field in the 1/table is distinct.
It says, this Status1 the value range of this column is 0-999, a total of 1000 can be the value of the row,
So the choice line is 1/1000=0.001, so it fits all density=0.001 here.
According to this calculation, the remaining two fields are selected in 1/250=0.004 and 1/50=0.02, respectively, and all density as follows.
Execution plan estimates for data rows
After talking about the basic problem of statistics, we can observe the rule of estimating the target data by the execution plan.
Let's look at a query like this, and note that this is the condition of the query as a parameter variable, not a direct value, and I'll explain why.
To observe the estimate of the data rows of the execution plan: As you can see, the estimate is 4 rows.
So how did this 4 line come to be calculated?
It's going to take advantage of our selectivity,
The selectivity of the Status1 field is that the selectivity of the 0.001,STATUS2 is 0.04,
In SQL Server 2012, the estimate of the data rows is the product of the selectivity of each field,
If PN represents the selectivity of a different field, the estimated number of rows is calculated by: Estimated Number of rows =p0*p1*p2*p3......*rowcount
Therefore, the execution plan displays: Estimated number of rows =0.001*0.004* total rows (also 1000000) = 4
Here are some two possible questions to explain:
First, the above example is queried with two fields, why not take three fields for a demonstration?
Preferred, no matter how many field queries, the estimated number of rows in line with the above calculation method is not a problem,
However, if the result calculated by the above formula is very small, the SQL Server display is estimated to be 1 rows in less than 1 of cases.
According to the above calculation method, with three fields to do the query,
Estimated number of rows =0.001*0.004*0.02* total rows (also known as 1000000) = 0.08<1, so the estimate is 1 lines.
Second, why not use the value query directly, but use the variable to do the query?
The students who are familiar with SQL Server should know that when you query directly with variables, SQL Server compiles without knowing the specific parameter values.
In the case where the specific parameter value is not known, it is used as a general (or mean) value when using the field's selectivity.
That is, the total calculation of the field in the statistical information of the selectivity, that is, all density=0.001
It is tentatively assumed that the distribution of data is uniform, that is, the distribution of each value is not very different.
But in fact there are differences in the distribution of each value,
Especially when the distribution is uneven, of course, this is another very big topic, here is not discussed.
If the query is done directly with explicit values.
such as SELECT * from Teststatistic where status1=885 and status2=88
SQL Server depends on the statistics in each field: the number of status1=885 rows and the specific values of the number of status2=88 rows,
Use the above formula to make an estimate
Then continue with the specific values to illustrate the presentation,
You can use the where status1=885 and status2=88 to see the estimated results directly.
First we look at the number of rows of status1=885 in the statistics, 1079 rows
And then look at the number of rows of status2=88 in the statistics, 3996 rows
Using the above formula, the estimated number of rows is 4.31168 rows
So is it the estimated number of rows to make a query directly with the value? Direct, perfectly matched with the results obtained by the above calculation method.
Thirdly, if the index is not indexed, it is in accordance with the estimated calculation method, if it is created.
The statistics for each column in the query condition are non-correlated,
If you create index information for individual columns on individual columns, they are also non-related statistics when queried.
For example, in other words, although the index was created, the execution plan changed,
From the beginning of the table scan into a two-index lookup after doing a hash join, and then loop join query data, we no matter what it is to become a plan of execution
However, the estimate of the data is the same as the full table scan above, which is estimated to be 4.31168, not because the index was created and the execution plan changed (estimated number of rows).
Because even if you create an index on a single column, the execution plan changes, but the statistics are irrelevant--that is, a statistic only describes the distribution of a column of fields.
Then do the tests in SQL Server 2014
With the same data above, I will import the results of the test tables under the above SQL Server 2012 instance into a table under the instance of SQL Server 2014 through link server.
Now the table structure and data are exactly the same.
First, do a similar test, make a query using the query criteria of the two variables query, and see what happens to the algorithm that SQL Server 2014 predicts.
Remember the same writing in SQL Server 2012 above, the same estimate of the data, just estimated 4 rows, now how to become 63.2456 rows?
The calculation formula of the estimated number of lines has changed, of course, this is the focus of this article.
So what is the estimate in SQL Server 2014? The formula is this: estimated number of rows = P0*P11/2 * P21/4 * p31/8......* RowCount
So the question of estimating the estimated number of rows is calculated based on this calculation: estimated number of rows =0.001*0.0041/2*1000000 =?
Here I do not do the root operation, take doctrine, directly with SQL Server to calculate down, SQL Server gives us a root function (SQRT), real JB Easy to use.
Calculate the results.
Yes, it's 63.24555, and keeping four digits is 63.2456, and the estimated number of rows is exactly the same as the results calculated above.
Supplementary Test 1:
Similarly, with three criteria for querying, the pre-estimate method also compound the results of the above formula.
According to the formula to calculate the estimated number of rows, the selectivity according to the overall calculation of the selectivity, is also consistent.
Supplementary Test 2:
If you change the query criteria to a specific value, as in SQL Server 2012, SQL Server2014 will also be calculated based on the specific data worth.
Make this query: SELECT * from TestStatistics2014 where status1=858 and status2=88
Explain why this Status1 was replaced by 858:
Because even if the table structure, the data is exactly the same, limited by the statistics of the Step (Steps) only 200, two library statistics are not exactly the same, statistics can not be accurate to any one value,
We are here to demonstrate this algorithm, find a specific Range_hi_key value, it is relatively easy to explain the problem.
First look at the data distribution of status1=858
And look at the data distribution of status2=88.
Estimates calculated using the above calculation method: 63.27713
Estimate of the execution plan: 63.27713, which is exactly the same.
Supplemental Test 3, create a separate index on the query column
As in SQL Server 2012, the execution plan has changed, but the estimate for the data row is not changed by the execution plan (estimated number of rows).
Although the execution plan changed, but the estimate of the data did not change, the estimated algorithm is still in line with: Estimated number of rows = P0*P11/2 * P21/4 * p31/8......* RowCount
As can be seen, the execution plan for (not exceeding the statistical information range) data row estimates, there is a certain regularity,
This rule is:
In SQL Server 2012, the estimated number of rows =p0*p1*p2*p3......*rowcount (PN is the selectivity of the query field),
In SQL Server 2014, the estimated number of rows = P0*P11/2 * P21/4 * p31/8......* RowCount (PN is the selectivity of the query field).
Of course, if the statistics are out of date or sampling density is not enough, it is a matter of fact, this is related to the updating of statistical information strategy, but also a very large and very realistic problem, not in-depth discussion.
So at first I said that the statistical information itself is not ideal, this is in the statistical information is very complete test of the situation.
Why does Microsoft in SQL Server 2014 make such a change to the estimated line number algorithm that is not relevant and does not exceed the statistics range?
Because the value of PN is less than 1
The estimated number of rows is calculated from P0*p1*p2*p3......*rowcount to P0*P11/2 * P21/4 * p31/8......* RowCount, obviously increasing the size of the estimated number of rows,
Another case not mentioned in this article is that the new estimation method also increases the size of the estimated number of rows in case of exceeding the statistical information range.
On the whole, the algorithm tends to be "more than a few estimates", there is such a change
Why do we have to make this change?
If you often do SQL optimization, you will find that many of the problems are less estimates of the expected number of rows of data (for a variety of reasons, here for the time being not discussed why the low estimate),
Insufficient resources allocated to execute SQL, slowing down SQL execution efficiency
A very typical problem is that the estimated data is smaller than the actual number of rows, resulting in a lack of memory grant, and an unreasonable execution plan during the actual operation.
Personally, it is possible to increase the efficiency of SQL execution by acquiring more system resources (within a certain range).
Under normal circumstances will not say that the actual value of the difference is too outrageous resulting in a waste of resources.
Of course, there are special circumstances.
Note that I have a premise here, non-related statistics, regardless of whether there is no index, or is created and the index on a single column, the corresponding statistics are non-related statistical information,
If you create a composite index (someone is used to call a composite index), then the estimate of the execution plan for the data row does not conform to the above algorithm, the specific algorithm I am not clear.
In this case, the pre-estimate method in SQL Server 2012 and SQL Server 2014 is not the same, this has the opportunity to study it again.
Additional instructions for the test results:
The test process must ensure that the integrity of statistical information, as well as the percentage of sampling, in the rational case is based on 100% sampling,
In the middle I omitted some of the details, such as the update statistics teststatistic with Fullscan before the test, guaranteed 100% sampling.
Since the number of decimal points to be accurate, of course, the requirement is the ideal condition, the purpose is to exclude the other conditions on the test results.
Summarize:
This article provides a simple example of how and how SQL Server calculates data estimates through statistics, and the differences between SQL Server 2012 and SQL Server2014.
Statistics is central to the choice of SQL execution plan, not only in SQL Server databases, including other relational databases, but statistics are a very important database object.
It can be said that SQL optimization, statistical information and the associated implementation plan is a very important factor, understanding the statistical information knowledge of performance tuning has a very important role.
When it comes to statistics on a composite index, the execution plan estimates the rows of data, and SQL Server2012 is not the same as SQL Server 2014, and the problem will be more interesting and time to write.
Reference: Fanr_zh The http://www.cnblogs.com/Amaranthus/p/3678647.html of the Great God
and http://msdn.microsoft.com/en-us/library/dn673537.aspx
SQL Server execution plan uses statistics to estimate the data rows and changes in the estimated policies in SQL Server 2014