Some methods of SQL Server to estimate and calculate the estimated number of rows are discussed.

Source: Internet
Author: User

Some methods of SQL Server to estimate and calculate the estimated number of rows are discussed.

About the base Estimation in SQL Server 2014, the official document Optimizing Your Query Plans has a lot of details in the SQL Server 2014 Cardinality Estimator, but all of them are in English. It is estimated that few people have carefully read this article. In SQL Server 2014, how does one calculate the estimated number of basic data rows? What are the rules? Let's take a look at some examples below. The test cases below are for reference only. If there are deficiencies or superficial problems, please kindly advise!

The environment for the following experiment is mainly SQL Server 2014 SP2 (Standard Edition (64-bit). The specific version number is 12.0.5000.0. If you have another version, it will be described later. As shown in the following figure, create a test table and insert some test data to facilitate subsequent testing.

IF EXISTS(SELECT 1 FROM sys.objects WHERE type='U' AND name='TEST_ESTIMATED_ROW')BEGIN DROP TABLE TEST_ESTIMATED_ROW;ENDIF NOT EXISTS(SELECT 1 FROM sys.objects WHERE type='U' AND name='TEST_ESTIMATED_ROW')BEGIN  CREATE TABLE TEST_ESTIMATED_ROW  (   ID  INT,   NAME VARCHAR(24)  )ENDGODECLARE @Index INT =1;WHILE @Index <= 100BEGIN INSERT INTO TEST_ESTIMATED_ROW VALUES(10, 'id is 10'); SET @Index+=1;ENDGODECLARE @Index INT =1;WHILE @Index <= 200BEGIN INSERT INTO TEST_ESTIMATED_ROW VALUES(20, 'id is 20'); SET @Index+=1;ENDGODECLARE @Index INT =1;WHILE @Index <= 300BEGIN INSERT INTO TEST_ESTIMATED_ROW VALUES(30, 'id is 30'); SET @Index+=1;ENDGODECLARE @Index INT =1;WHILE @Index <= 400BEGIN INSERT INTO TEST_ESTIMATED_ROW VALUES(40, 'id is 40'); SET @Index+=1;ENDGODECLARE @Index INT =1;WHILE @Index <= 500BEGIN INSERT INTO TEST_ESTIMATED_ROW VALUES(50, 'id is 50'); SET @Index+=1;ENDGOCREATE INDEX IX_TEST_ESTIMATED_ROW_N1 ON TEST_ESTIMATED_ROW(ID);GO

Let's take a look at the statistical information and histogram content of this table.

DBCC SHOW_STATISTICS ('dbo.TEST_ESTIMATED_ROW','IX_TEST_ESTIMATED_ROW_N1');GO

SQL Server has two types of predicates: Filter predicates and join predicates. Let's first take a look at the base estimation (Estimated number of rows) of the filter predicate. During the test, if you want to maintain the fairness of the test or not affected by other factors, you can use the following DBCC command to eliminate interference, example:

Dbcc freeproccache; -- Delete All cached execution plan godbcc dropcleanbuffers from execution plan buffer; -- delete all caches FROM the buffer pool and clear the buffer GOSELECT * FROM dbo. TEST_ESTIMATED_ROW where id = 10; GO

(Note: Check the include actual execution plan button during execution.) As shown above, the Estimated Number of Rows (Estimated Number of Rows) is 100, which is consistent with the actual Number of Rows. Of course, if you change other values, such as 20, 30, 40, and 50, the Estimated Number of Rows (Estimated Number of Rows) the actual number of rows is correct (the test results in SQL SERVER 2012 are also the same ). What if I change a value that does not exist? What is the estimated number of rows?

SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID = 4;GO

As shown above, the Estimated Number of Rows (Estimated Number of Rows) is 1. If you change to any other value that does not exist, the Estimated Number of Rows (Estimated Number of Rows) is 1. This follows the old base evaluation: beyond the scope of the statistical information, the old base evaluation does not exist, and the number of evaluation rows is 1. Obviously, for records that do not fall beyond the scope of statistical information but do not exist, the Estimated Number of Rows (Estimated Number of Rows) is also 1. This base estimation is true and true. What if I use a variable? What is the Estimated Number of Rows (Estimated Number of Rows?

DECLARE @ sid int = 11; -- for any value, SELECT * FROM dbo. TEST_ESTIMATED_ROW where id = @ SID; GO

As shown above, the Estimated Number of Rows (Estimated Number of Rows) of the actual execution plan is 300. How can this problem be calculated? In fact, you can calculate based on the formula. If you don't believe it, you can construct a variety of cases to test and verify that you can get the answer.

[Row Sampled] * [ALL density] = 1500*0.2 = 300 that is, the total number of samples in the statistical information * All Density (the density of each prefix of each column in the statistical information object)

If you addOPTION(RECOMPILE)Then the Estimated Number of Rows (Estimated Number of Rows) will change to 1.

DECLARE @SID INT = 11; SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID = @SID OPTION(RECOMPILE)GO

If you assign @ SID a value of 20 and add OPTION (RECOMPILE), the Estimated Number of Rows (Estimated Number of Rows) will become the value of EQ_ROWS.

DECLARE @SID INT = 20; SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID = @SID OPTION(RECOMPILE)GO

Next, modify the SQL statement to change the query condition from equal to greater than the symbol, as shown below:

DECLARE @SID INT = 11;SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID > @SID;GO

As shown above, the estimated number of rows(Estimated Number of Rows)To 450, how can this value be calculated?

Calculation formula: [Row Sampled] * 0.3 (30%)

1500*0.3 = 450

Someone will ask, how do you know it's [Row Sampled] * 0.3? It's not your reverse push. Yes, here is a speculation (there are a lot of information on the Internet that is confirmed to be 0.3, right and used as a constant in the calculation formula), and a lot of tests have been done, indeed, 30%. For example, if you assign @ SID a value of 41 and the Estimated Number of Rows (Estimated Number of Rows) is still 450, You can first clear the cached execution plan because you suspect it is a cache execution plan, the result is still true. Based on my tests, no matter what value you assign to @ SID, the Estimated Number of Rows (Estimated Number of Rows) is 450

DBCC FREEPROCCACHE;GODBCC DROPCLEANBUFFERS;GODECLARE @SID INT = 41;SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID > @SID;GO

 

If OPTION (RECOMPILE) is added to the SQL statement and @ SID is assigned to the value in RANGE_HI_KEY, how is the Estimated Number of Rows calculated?

DECLARE @SID INT = 20;SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID > @SID OPTION(RECOMPILE);GO


This 1200 is calculated in this way. As shown below, RANGE_HI_KEY greater than 20 has 30, 40, 50, and their corresponding EQ_ROWS values add 300 + 400 + 500 = 1200, if you don't believe it, you can test it and assign @ SID to 30, then the Estimated Number of Rows (Estimated Number of Rows) will become 900.

Then let's modify the SQL query statement. For example, if we want to perform a range query, what changes will happen to the Estimated Number of Rows (Estimated Number of Rows?

DBCC FREEPROCCACHE;GODBCC DROPCLEANBUFFERS;GODECLARE @Min_Value INT = 20;DECLARE @Max_Value INT = 50;SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID > @Min_Value AND ID < @Max_ValueGO

As shown above, how does the Estimated Number of Rows (Estimated Number of Rows) come from 246.475? In fact, it is calculated as follows:

Selectivity of most selective predicate * Square root of (selectivity of second most selective predicate) * Total number of rowsSELECT 0.3 * SQRT (0.3) * 1500 -- 246.475150877325 -- 0.3 is a constant in the calculation rule.

If I execute this SQL statement in SQL Server 2012 or use query trace mark 9481 to disable the new base evaluation, the database optimizer uses the old base evaluation, you will find that the Estimated Number of Rows (Estimated Number of Rows) is 135. As follows:

DBCC FREEPROCCACHE;GODBCC DROPCLEANBUFFERS;GODECLARE @Min_Value INT = 20;DECLARE @Max_Value INT = 50;SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID > @Min_Value AND ID < @Max_ValueOPTION (QUERYTRACEON 9481);GO


The formula here is

((Estimated number of rows for first predicate) *(Estimated number of rows for second predicate)) /Total number of rows(0.3*1500)*(0.3*1500)/1500 = 0.09*1500 = 135 

Now, we insert 50 records to the TEST_ESTIMATED_ROW table. At this time, the data volume does not trigger Statistics Updates. At this time, the value of ID = 55 exceeds the maximum value of RANG_HI_KY In the histogram by 50, that is to say, the newly inserted data is not counted in the histogram, which is called the ascending key problem ). This problem occurs when you run a query on the data before updating the statistics.

DECLARE @Index INT =1;WHILE @Index <= 50BEGIN INSERT INTO TEST_ESTIMATED_ROW VALUES(55, 'id is 50'); SET @Index+=1;ENDGO

Then let's take a look at the Estimated Number of Rows (Estimated Number of Rows) of the following SQL statement, as shown below:

DBCC FREEPROCCACHE;GODBCC DROPCLEANBUFFERS;GOSELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID = 55;GO


So how is the Estimated Number of Rows (Estimated Number of Rows) calculated as 39.37? In fact, this problem is.

Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator. the formula for calculating the base number is [All density] * [Rows Sampled]. However, the actual test shows that this is not the case. Let's test the example in the white paper document first (note that the database instance is SQL Server 2014 and the compatibility level of AdventureWorks2012 is 120 ), check whether the examples in the document are correct.

SELECT [SalesOrderID], [OrderDate] FROM Sales.[SalesOrderHeader]WHERE [OrderDate] = '2005-07-01 00:00:00.000';SELECT [s].[object_id],  [s].[name],  [s].[auto_created]FROM sys.[stats] AS sINNER JOIN sys.[stats_columns] AS [sc]  ON [s].[stats_id] = [sc].[stats_id] AND   [s].[object_id] = [sc].[object_id]WHERE [s].[object_id] = OBJECT_ID('Sales.SalesOrderHeader') AND  COL_NAME([s].[object_id], [sc].[column_id]) = 'OrderDate';

The OrderDate statistical information is _ WA_Sys_00000003_4B7734FF.

 

DBCC SHOW_STATISTICS ('sales. salesorderheader', _ WA_Sys_00000003_4B7734FF );

When the last statistics information is updated, the maximum value of the collected RANGE_HI_KEY is 00:00:00. Then, 50 records are inserted, and the data volume does not trigger statistics updates.

INSERT INTO Sales.[SalesOrderHeader] ( [RevisionNumber], [OrderDate],           [DueDate], [ShipDate], [Status],           [OnlineOrderFlag],           [PurchaseOrderNumber],           [AccountNumber], [CustomerID],           [SalesPersonID], [TerritoryID],           [BillToAddressID], [ShipToAddressID],           [ShipMethodID], [CreditCardID],           [CreditCardApprovalCode],           [CurrencyRateID], [SubTotal],           [TaxAmt], [Freight], [Comment] )VALUES ( 3, '2014-02-02 00:00:00.000', '5/1/2014', '4/1/2014', 5, 0, 'SO43659', 'PO522145787',29825, 279, 5, 985, 985, 5, 21, 'Vi84182', NULL, 250.00,25.00, 10.00, '' );GO 50 -- INSERT 50 rows, representing very recent data, with a current OrderDate value

Then we turn on the SQL tracking flag 9481, and you will find that the estimated number of SQL rows below is 1. This is because the optimizer uses the old base estimate.

SELECT [SalesOrderID], [OrderDate] FROM Sales.[SalesOrderHeader]WHERE [OrderDate] = '2014-02-02 00:00:00.000'OPTION (QUERYTRACEON 9481); -- CardinalityEstimationModelVersion 70

When the SQL tracking flag is canceled and the database uses a new base estimate, the estimated function is changed to 27.9938.

DBCC FREEPROCCACHE;GODBCC DROPCLEANBUFFERS;GOSELECT [SalesOrderID], [OrderDate] FROM Sales.[SalesOrderHeader]WHERE [OrderDate] = '2014-02-02 00:00:00.000'

31465*0.0008896797 ~ = 27.9937717605 ~ = 27.9938 (rounding)

The example in the White Paper is true, but the above example does not know how to calculate the estimated number of rows. Although some speculation is made, it cannot be verified in other examples. I don't know if this White Paper is incorrect or the SQL Server base estimation has been adjusted, or is the base estimation (CE) algorithm far more simple than that? I have been entangled in this issue for two days, but I still haven't figured it out! During testing and speculation, I found a new problem: when data is added to the table, will the results of the previous test column be the same? The answer is different. As follows:

SELECT * FROM dbo.TEST_ESTIMATED_ROW WHERE ID = 10;GO

How is the estimation function calculated from 100 to 103.333? I guess this is the case (as shown below ).

SELECT 1550*(100.0/1500) --~= 103.332300 

That is to say, the ascending key problem will also affect the estimation function. The above is the deduction of the Estimated Number of Rows (Estimated Number of Rows) of a simple SQL statement. In actual conditions, the SQL statement is much more complex than this one. in complex cases, for example, how does a base estimate the number of rows when multiple filter predicates exist? Since the previous example is relatively simple to construct and is not suitable for subsequent demonstration, we will use the example in Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator for a simple demonstration:

USE [AdventureWorks2012];GOSELECT  [AddressID],  [AddressLine1],  [AddressLine2]FROM Person.[Address]WHERE [StateProvinceID] = 9 AND   [City] = N'Burbank' AND   [PostalCode] = N'91502'OPTION (QUERYTRACEON 9481); -- CardinalityEstimationModelVersion 70GO

As shown in the following figure, the statistical information corresponding to the filter predicates [StateProvinceID], [City], and [PostalCode] Is IX_Address_StateProvinceID, _ WA_Sys_00000004_164452B1, and _ region.

SELECT [s].[object_id],    [s].[name],    [s].[auto_created],    COL_NAME([s].[object_id], [sc].[column_id]) AS [col_name]FROM  sys.[stats] AS sINNER JOIN sys.[stats_columns] AS [sc]    ON [s].[stats_id] = [sc].[stats_id] AND      [s].[object_id] = [sc].[object_id]WHERE  [s].[object_id] = OBJECT_ID('Person.Address');


DBCC SHOW_STATISTICS ('Person.Address', _WA_Sys_00000004_164452B1); -- City


SELECT 196.0/19614 ~= 0.0099928DBCC SHOW_STATISTICS ('Person.Address', IX_Address_StateProvinceID); -- StateProvinceID


SELECT 4564.0/19614 ~= 0.2326909DBCC SHOW_STATISTICS ('Person.Address', _WA_Sys_00000006_164452B1); -- PostalCode


SELECT 194.0/19614 ~ = 0.0098908 -- Record

From SQL Server 7 ~ SQL Server 2012. If two or more predicates in the query condition use AND join, the product of the choice rate Si of each predicate is used as the choice rate of the query estimation function.

(S1 * S2 * S3....*Sn)(S1 * S2 * S3....*Sn) *(Rows Sampled) SELECT 0.0098908 * -- PostalCode predicate selectivity    0.0099928 * -- City predicate selectivity    0.2326909 * -- StateProvinceID predicate selectivity    19614;   -- Table cardinality

The calculation result is 0.451091024458953138624, which is less than 1 row. Therefore, the query optimizer uses the estimated minimum number of rows (1 ). Next, let's take a look at how the new base estimate in SQL Server 2014 calculates the estimated number of rows.

SELECT  [AddressID],  [AddressLine1],  [AddressLine2]FROM Person.[Address]WHERE [StateProvinceID] = 9 AND   [City] = N'Burbank' AND   [PostalCode] = N'91502'GO

So how is the new base estimate (SQL Server 2014) Estimated Number of Rows (Estimated Number of Rows) 13.4692 calculated? In fact, they use the following formula for the selection rate, where p0 <p1 <p2 <p3 <p4


SELECT 0.0098908        * -- PostalCode predicate selectivity    SQRT(0.0099928)     * -- City predicate selectivity    SQRT(SQRT(0.2326909))  * -- StateProvinceID predicate selectivity    19614; -- Table cardinality

The calculation result is 13.4690212669225 ~ = 13.4692 is there any difference? When using the following SQL comparison, you will find that the reason is actually the exact digits after the decimal point and rounding. I do not know how to calculate the exact number of digits.

So how is the OR Selectivity calculated? Let's take a look at how the old base estimation is calculated, as shown in the following example:

USE [AdventureWorks2012];GOSELECT  [AddressID],     [AddressLine1],     [AddressLine2]FROM Person.[Address]WHERE ([StateProvinceID] = 9 OR   [City] = N'Burbank' )AND   [PostalCode] = N'91502'OPTION (QUERYTRACEON 9481); -- CardinalityEstimationModelVersion 70


0.0098908 -- PostalCode predicate selectivity0.0099928 -- City predicate selectivity0.2326909 -- StateProvinceID predicate selectivity

Calculation formula: (S1 + S2)-(S1 * S2). Then (S1 + S2)-(S1 * S2) the calculated value is

(0.0099928 + 0.2326909) - (0.0099928 * 0.2326909) ~= 0.24035846637448 

Then, with the AND operation, we execute the AND operation before SQL Server 2014. The selectivity of this calculation is S1 * S2.

0.0098908 * ((0.0099928 + 0.2326909) - (0.0099928 * 0.2326909)) ~= 0.002377337519216706784

The final calculation result is as follows:

0.002377337519216706784*19614 ~ = 46.629098101916486861376 ~ = 46.6296 (note that this error is caused by precise decimal places and rounding)

Let's take a look at the formula for calculating OR Selectivity in SQL Server 2014.

USE [AdventureWorks2012];GOSELECT  [AddressID],     [AddressLine1],     [AddressLine2]FROM Person.[Address]WHERE ([StateProvinceID] = 9 OR   [City] = N'Burbank' )AND   [PostalCode] = N'91502'

How is the Estimated Number of Rows calculated? Paul White's blog is based on the following computation.

0.0098908 -- PostalCode predicate selectivity0.0099928 -- City predicate selectivity0.2326909 -- StateProvinceID predicate selectivity 

A or B = NOT (not a) AND (not B) means a or B and not (not a) AND (not B) are equivalent.

So we can calculate this. The final Estimated Number of Rows (Estimated Number of Rows) is calculated as 94.3525, which is slightly different from the result 94.3515 (this is because of the floating point precision and rounding)

SELECT 1- (1- 0.2326909)*SQRT(( 1-0.0099928)) ~= 0.236534308898679SELECT 0.009891 *SQRT(1- (1- 0.2326909)*SQRT(( 1-0.0099928)) )*19614 ~= 94.3525070823501 ~= 94.3515

The above are some preliminary discussions and understandings on how to calculate the estimated number of rows in the base estimation (CE) in SQL Server, and the problems I have had are not yet clarified. Although it is a bit regrettable, it is very interesting to explore these rules during the testing process.

The above is a small part of the SQL Server's discussion on how to estimate the number of rows estimated by the base number. I hope it will help you. If you have any questions, please leave a message, the editor will reply to you in a timely manner. Thank you very much for your support for the help House website!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.