Parametric (iv): Handling nonuniform data distribution

Last Update:2016-06-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We learned earlier that parameter sniffing might be good or bad. The parameter sniffing is a bad thing when the distribution of the series is uneven. For example, consider that the "Status" column has a total of 10M rows in the Orders table. The column has 7 different values, which are distributed as follows:

Status	Number of Rows
Open	314
Pending Approval	561
Approved	28,990
Paid	17,610
Shipped	817,197
Closed	7,922,834
Cancelled	1,032,886

If you use parameter sniffing when querying the data for status "Open", the optimizer is likely to choose an execution plan with index seek and key lookup. This plan is placed in the cache for ease of reuse. When other users execute the query closed state, the same execution plan is reused, which is likely a disaster because 8M key-value lookup operations are now performed.

Another bad case of using parameter sniffing is using arguments with non-equal predicates. Consider the following query:

Selectid, CustomerId, Transactiondatetime, Statusidfrombilling.transactionswheretransactiondatetime BETWEEN @ Fromdatetime and @ToDateTimeORDER bytransactiondatetime ASC;

If the query uses parameter sniffing compilation, using the value "2014-07-01″" and "2014-08-01″", then the optimizer estimates the number of rows based on the statistics and presumably estimates the number of rows to 20000. The plan is then created based on the estimated number of rows and placed in the cache. Subsequent executions can use completely different parameters. For example, the user executes the query with the time parameter "2012-01-01″ and" 2014-01-01″. There are about 61000 rows in the result set, but a plan based on the previous number of rows is reused and probably not a good execution plan.

So, what can we do to influence the parameter sniffing?

I'll show some of the techniques that are based on my previous use of stored procedure instances:

CREATE Proceduremarketing.usp_customersbycountry (@Country as NCHAR (2)) Asselectid, Name, Lastpurchasedatefrommarketing.customerswherecountry = @Country; GO

Here is the distribution of a "Country" column:

Country	Number of Rows
Be	70
Cl	55
CN	29,956
Dk	74
EG	64
IL	72
MT	83
Pt	75
Tr	63
UK	28,888
US	40,101
Ve	78

As you can see, there are 12 different values, of which three are more rows, while the remaining rows are very few. This is an extreme distribution of uneven conditions that may be difficult to see in a production environment. This is where I can show my point of view ...

Before discussing a workable solution, let's look at the problem ...

First, the parameter is assigned the value IL. When the stored procedure is first executed with the "IL" parameter, the build plan contains an index looking for "country". For this specified execution it is very helpful for the optimizer to estimate the number of rows is 72, completely accurate.

The next time the stored procedure executes, use the parameter "US". There are 40,101 rows in the data, and the best execution plan in this case is to use a clustered index scan to avoid many "key lookups". But the plan is already in memory and will be reused. Unfortunately, this plan contains index lookups and "key lookup" instead of a clustered index scan, which is a very poor execution plan. At this point we see that the estimated number of rows in the index lookup operator's properties is 72, and then it is actually 40000+. This is the estimated number of row errors caused by the execution plan error. If we look at the "Parameter List" property of Select, we can see why. Since compilation 1 is "IL", the runtime is "US".

So now that we've found the problem, let's take a look at the possible solutions ...
Solution #1 –sys.sp_recompile

It is simple to use the system stored procedure sys.sp_recompile to remove the specified execution plan from the cache or all the specified tables and views that are referenced by the plan. This means that the next time the stored procedure executes again, it needs to be recompiled and the new execution plan will be created.

Remember that our main problem is the distribution of values. Therefore, recompiling the stored procedure based on a new set of parameters will create the specified execution plan, but most of the time this does not solve the problem because the new plan is still only good for the value of this time, and is still a bad plan when encountering other different distributed parameter values. I suggest that when the value filtered in the query is the only value in the vast majority of cases, it is possible to consider recompiling to solve the problem, such as a good plan when the status state at 1 occupies 99% of the data value.

Solution #2 –with RECOMPILE

If you don't like the gambling method in front of you, then with recompile is perfect for you. Unlike the parameter values that the previous dependency passed to the specified execution, this way you can tell the optimizer to compile the plan in each stored procedure.

ALTER Proceduremarketing.usp_customersbycountry (@Country as NCHAR (2)) Withrecompileasselectid, Name, Lastpurchasedatefrommarketing.customerswherecountry = @Country; GO

Each time the parameter sniffing is used, it means that execution will be given the best execution plan provided by the optimizer. Now that every execution of the new plan is created, SQL Server will not put the plan into the cache.

This is a good solution because each execution of the stored procedure produces an optimal plan that eliminates the side effects of random gambling. But the disadvantage is that each compilation must undergo an expensive optimization process. This is a process that requires intensive CPU processing. This approach is not appropriate if the system is already in the PCU high load and the stored procedures are frequently executed. On the other hand, if the CPU usage is relatively low and the stored procedure executes only occasionally, then this is the best solution for you.

Solution #3 –option (RECOMPILE)

is a similar solution to the former, but there are two important points of difference. First, this query parameter is for the problematic query statement instead of the entire stored procedure.

ALTER Proceduremarketing.usp_customersbycountry (@Country as NCHAR (2)) Asselectid, Name, Lastpurchasedatefrommarketing.customerswherecountry = @CountryOPTION (RECOMPILE); GO

It saves a significant amount of resources for the recompilation of one statement.

Second, "with RECOMPILE" occurs at compile time, while "OPTION (RECOMPILE)" occurs at run time. The entire example runs when executing this statement, pausing execution, recompiling the query, and generating a new execution plan. In other parts, the plan cache is used. The benefit of Runtime compilation is that the optimizer knows all the run-time values in advance and does not even need parameter sniffing. The optimizer knows the values of parameters, local variables, and environment settings, and then compiles the query using that data. In most cases, the run-time compilation generates a much better plan than the compile-time build.

Therefore, you should consider using "OPTION (RECOMPILE)" instead of "with RECOMPILE", because it uses fewer resources to longevity a better plan. But be aware that this approach is still very CPU-intensive.

Solution #4 –optimize for

Another query option, "OPTIMIZE for", also solves the problem of parameter sniffing. This option instructs the optimizer to compile the query with a specific set of parameters rather than the actual parameters. is actually rewriting the parameter sniffing. Note that this option can only be used if the query has to be recompiled. The option itself does not cause recompilation.

ALTER Proceduremarketing.usp_customersbycountry (@Country as NCHAR (2)) Asselectid, Name, Lastpurchasedatefrommarketing.customerswherecountry = @CountryOPTION (OPTIMIZE for (@Country = N ' US ')); GO

Remember "Sales. Orders table? The execution of 99% uses "Pending Approval" as the parameter. Instead of using sys.sp_recompile (RECOMPILE), in summary, if you want this parameter to be used for the next execution, the use of optimize for will be a better choice for this situation, It also instructs the optimizer to use this parameter whenever the actual parameter is executed at the next execution (as in the previous example, US).

Parameter sniffing can be suppressed by using "OPTIMIZE for UNKNOWN". This option instructs the optimizer to set the parameter to a location, which in effect disables the parameter sniffing. If the stored procedure has more than one parameter, you can do the option processing (disable) for each parameter separately.

ALTER Proceduremarketing.usp_customersbycountry (@Country as NCHAR (2)) Asselectid, Name, Lastpurchasedatefrommarketing.customerswherecountry = @CountryOPTION (OPTIMIZE for (@Country UNKNOWN)); GO

Solution #5 – The best solution

So far you may have noticed that there are two goals we want to achieve that are conflicting. One is to create an optimal plan for each execution, and the other is to minimize the waste of resources by minimizing compilation. The "with RECOMPILE" method accomplishes the first goal, but it requires each to perform a recompile. On the other hand, the Sys.sp_recompile method compiles only one stored procedure at a time, but does not produce the best plan for each execution.

The best solution is to balance the goals of these two conflicts. This balance idea is to separate the parameter values into different groups, each with different optimization plans, and to generate different optimization plans. Each plan is compiled only once, and from that point on, each execution is best planned because the plan is generated based on the parameter values, so a reasonable grouping results in a plan for the corresponding group.

Does that sound like magic? Let's take a look at how this trick is implemented ...

First we need to divide the values into different groups. This is a key part, and there are many ways to group it. Here I will use the country as a parameter to divide the general and non-ordinary countries into two groups. If the number of rows in the country accounts for more than 1% of the table rows, I define it as a normal country. Assume that SQL Server has defined a common country by counting the country column fields. SQL Server typically uses normal parameter values as an entry for graphical statistics.

So we insert the ordinary country into the "commoncountries" Table of "country", and then delete the non-ordinary country ...

CREATE tablemarketing.commoncountries (Range_hi_keynchar (2) NOT NULL, Range_rowsintnot null, Eq_rowsintnot NULL, Distinct_range_rowsintnot NULL, Avg_range_rowsfloatnot null, Constraintpk_commoncountries_c_rangehikeyprimary KEY CLUSTERED (Range_hi_key ASC)); Goinsert intomarketing.commoncountries (Range_hi_key, Range_rows, Eq_rows, Distinct_range_rows, AVG_RANGE_ROWS) EXECUTE (' DBCC show_statistics (N ' marketing.customers ', ix_customers_nc_nu_country) with histogram '); Godeclare@rowcount as INT; Select@rowcount = COUNT (*) Frommarketing.customers;delete frommarketing.commoncountrieswhereeq_rows < @RowCount * 0.01; GO

The table query reads as follows:

Range_hi_key	range_rows	eq_rows	distinct_range_rows	avg_range_rows
CN	0	29956< /td>	0	1
uk	0	28888	0	1
us	0	40101	0	1

It's very clear. These three are examples of ordinary countries. Of course this is a relatively simple example, the actual environment may be more complex, and sometimes even need to propose some algorithms to distinguish between ordinary and non-ordinary values. I can use the results of this statistic. You can also use some kind of monitoring mechanism to track usage results and schedules. or need to develop a set of their own statistical mechanisms. However, most of the time it is necessary to develop an algorithm to differentiate the values into different groups.

Then we can use this country's grouping to generate the optimization plan separately. This approach requires the creation of different stored procedures, and the stored procedures are almost identical except for the names.

In the instance, I create "Marketing.usp_customersbycountry_common" and "Marketing.usp_customersbycountry_uncommon" two stored procedures. As follows:

CREATE Proceduremarketing.usp_customersbycountry_common (@Country as NCHAR (2)) Asselectid, Name, Lastpurchasedatefrommarketing.customerswherecountry = @Country; Gocreate Proceduremarketing.usp_customersbycountry_uncommon (@Country as NCHAR (2)) Asselectid, Name, Lastpurchasedatefrommarketing.customerswherecountry = @Country; GO

Next we modify an original stored procedure, which becomes a route. Its work is the value of the spread parameter and determines which corresponding stored procedure to execute based on the grouping of values.

ALTER Proceduremarketing.usp_customersbycountry (@Country as NCHAR (2)) asifexists ( Selectnullfrommarketing.commoncountrieswhererange_hi_key = @Country) BeginExecute marketing.usp_customersbycountry _common@country = @Country; Endelsebeginexecute marketing.usp_customersbycountry_uncommon@country = @Country; END; GO

This is a beautiful solution:

The first common country is used as a parameter, and the routing stored procedure calls the normal stored procedure. Once executed for the first time, the plan is produced in the cache. Thanks to the parameter sniffing, this plan will be used as long as the normal state's stored procedures are executed. Then the same is true for countries that are also not commonly used ...

Therefore, we provide a good plan for each parameter value, and each plan is compiled only once. Typically carcass has only 2 to 3 sets of values, so a maximum of 2 to 3 compilations. This is the essence of magic.

Disadvantages:

Of course this is just an ideal way to pay attention to the maintenance cost of the program. Once the data has changed, the algorithm has to maintain the modifications to adapt again. As the above example, it takes every time to recreate the table of the ordinary country.

Summary :

Parameter sniffing can be good or bad things. Since it is used by default in SQL Server, we should use it as long as it is good. Our aim is to identify parameter sniffing based on different scenarios, and then apply the method mentioned in the paper to solve the bad parameter sniffing problem.

In the future, I will choose some specific production problems to show the various parameter sniffing and the corresponding derivative problems of the treatment scheme.

Parametric (iv): Handling nonuniform data distribution

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More