SQL Server Optimization---Statistical information maintenance policy

Source: Internet
Author: User


First explain a concept, what the statistics are:
It is simply a description of the distribution of some field data, so that SQL Server probably knows the expected size of the data to guide a database object that generates a reasonable execution plan

Update policy for statistics by default:
1, table data changes from 0 rows to 1 rows
2, table with less than 500 rows adds 500 rows or more
3, when there are more than 500 rows in the table, the amount of data changes is greater than the number of rows in the 500+20%* table

Not by default, factors that promote the updating of existing statistics (including but not limited to the following three types, which I do not remember):
1,rebulid\reorg Index
2, Active UPDATE STATISTICS
3, database-level sp_updatestats

Start problem:

The update policy for large tables is that the amount of data changes is greater than the number of rows in the 500+20%* table
For example, for the 1000W data volume of the table, the data changes to more than 500+1000w*20%=2,000,500 before triggering the update of statistics,
This is not acceptable in most cases, why? Because the threshold for triggering statistics updates under this rule is too large, it can cause some statistics to be updated for a long time.
Because of the unreasonable execution plan caused by statistic information, it is not uncommon in the actual business, it is necessary to update the statistic information.

At the same time, it is not always reliable to update statistics by SQL Server alone, because it is also important to have a sample row number in the statistics.
Because SQL Server has a default sampling line count (default sampling, no sampling percentage specified, or percent of sampling when SQL Server automatically updates statistics),
The upper limit is around 100W (of course, not sure, just observation), and for a table with more than tens of millions of rows, this sample ratio is still very low.
For example, more than 300 million rows of the table, UPDATE statistics When the sampling percentage is not specified, the default sampling to go to 840,000 rows)
According to the landlord's observation, for small tables, not more than 500W rows of the table, the default sampling scale is no problem, for larger tables, such as more than 500W rows of the table (of course, this 500W line is also a reference value, not the absolute value)
Therefore, the default sampling scale is simply unable to accurately describe the distribution of data.

As a result, it is necessary to update human intervention statistics. So how to update the statistics of the index, is there a fixed way? The answer is in the negative.

Let's start by looking at how statistics can be updated

1,rebulid\reorg Index
Of course, the rebulid\reorg index is just a statistic that comes with updated indexes, mainly to defragment the index,
For large tables, the cost is quite large, the database maintenance strategy, there is no generalize method,
For smaller databases or smaller tables, such as hundreds stadiums tables, a rebuild index per day is available,
But the experience of porting to a larger database is probably not good enough (just as the celebrity's successful experience cannot be duplicated, everyone lives in a different environment and cannot generalize).
This rebulid\reorg index will be quite expensive for resources and time-consuming, and in some cases will not give you the opportunity to do so.
For example, the following rebuild a composite index time-consuming situation, just one index on a table, it took 5 minutes
It's normal for a business to have such 35 indexes on a complex table.
According to this, if the whole library or the entire instance under the more than 10 library, each library hundreds of tables all this, how long, the cost imaginable
Perhaps the whole is not complete, maintenance window period of time, unless the database is small (what is the critical value of size?) Personally think that can be roughly 100GB bar), otherwise it is not possible to do so.

Therefore, it can be considered that the cost of updating index statistics by rebuilding or reorganizing indexes is too high, which is basically unrealistic.

  


2,update Statistics

It is what I want to focus on, because I do not specify the grammar here, the specific syntax does not do a detailed explanation,
In short, there are several options:
A default way, it can also be a full-table scan update, and there is a sampling percentage specified as follows:

--all statistics on the table are updated by default update STATISTICS tablename--the specified statistics, sampling the UPDATE STATISTICS TableName with full table scan (Index_or_statistics_ _name) with FullScan--sampling the update STATISTICS TableName in the specified statistics, using the specified sampling percentage (index_or_statistics__name1,index_or_ STATISTICS__NAME2) with sample percent

Update statistics also gets the data distribution by scanning the data pages (index pages) relative to rebuilding or reorganizing the index, but does not move the data (index) page.
This is where the update statistics cost relative to the rebuild index (i.e. 100% sampling at UPDATE STATISTICS)
The key is the third way: The person is the specified sampling percentage, if the sampling percentage is 100, it's the same as Fullscan.
If not 100, such as 80,60,50,30, and how to choose? The higher the percentage of sampling, the more accurate the statistical information, but the higher the cost, the smaller the efficiency of the sample, the greater the likelihood of error, and the more likely it is to find a balance.
So how much to sample, both in the efficiency of updating statistics can be acceptable, but also to make statistical information to achieve a relatively accurate description of the distribution of data,
This is still a question that needs careful choice, why? Reference: http://www.cnblogs.com/wy123/p/5875237.html
If the statistical information sampling percentage is too low, it will affect the accuracy of the statistics.
If it's too violent, like Fullscan's way of scanning,
For reference, a table was update for 50 minutes (this is, of course, a large table with multiple index statistics and non-index statistics). If there are dozens of similar tables, the efficiency is conceivable
In short, there is no fixed way, the database is small, how to do the problem is not big, a large database, coupled with the maintenance of window time is limited, the quality of statistical information and maintenance efficiency of the comprehensive consideration

  

3, database-level sp_updatestats

Usage:
EXEC sp_updatestats
Or
exec sp_updatestats @resample = ' resample '

Specifies the resample option for sp_updatestats to use the UPDATE STATISTICS statement.

Sample is useful for a query plan that is based on a default sampling and is not the best special case.
In most cases, you do not have to specify SAMPLE,
This is because, by default, the query optimizer takes a sampling and statistically determines the size of a large number of samples in order to create a high-quality query plan.

If you do not specify ' resample ', sp_updatestats will use the default sampling to update the statistics.
The default value is NO.

Execute EXEC sp_updatestats UPDATE STATISTICS directly, sampling density is the default,
What exactly is this default, and MSDN says, by default, that the query optimizer takes the sample as needed, I think the sampling algorithm should be less rude.
At present also do not know how an algorithm or sampling method, if you know the park friends, please do not hesitate to enlighten, thank you

4,traceflag 2371

After Traceflag 2371 is turned on, the change in statistics is dynamically changed according to the table,
Breaking the trigger large table statistics update when more than 500 rows are in the table, the amount of data changes is greater than the number of data rows in the 500+20%* table threshold
Reference: https://blogs.msdn.microsoft.com/saponsqlserver/2011/09/07/ changes-to-automatic-update-statistics-in-sql-server-traceflag-2371/

In, you can see how the new formula works, and for the small table, the threshold is still around 20%,
This dynamic rule will be triggered only after more than 25000 rows
As the number of rows of data in the table increases, the percentage of (trigger statistics changes) becomes lower and
For example, for a table with 100, 00 rows, the threshold for triggering statistics updates has been reduced to 10%,
For a 1,000,000-row table, the threshold for triggering statistics updates has been reduced to 3.2%.

  

For tables with 10,000,000 or 50,000,000 rows, the threshold for triggering statistics updates is less than 1% or 0.5%,
And for his 100,000,000 rows of the table, only the change in the 0.31% or so, you can start the update of statistical information.

But personally, this approach is not necessarily reliable, although the threshold for triggering update index statistics after Traceflag 2371 is turned on, the sampling percentage is still a problem,
Before I have a misunderstanding, look at the statistical information only to focus on the statistics update time (with the database or the table you have encountered too small related)
For statistical information, updated in a timely manner (update time compared to the new) is not equal to this statistic is accurate, it is necessary to see the number of rows sampled as a percentage of total number of

How do I effectively maintain index statistics?

It said that in order to obtain relatively accurate statistics, it is necessary to update the statistical information in the sampling percentage,
For small tables, it is no problem to trigger statistics updates by their default change thresholds, or to update statistics according to 100% sampling.
For large tables, it is important to consider artificially updating this statistic before it reaches the threshold of the default triggering statistics update, but the 100% sampling statistics for large tables are less realistic (performance considerations)
The higher the percentage of sampling, the more accurate the statistical information, but the more expensive it is, the more it needs to find a balance, so what if the statistics on the big table are updated?
If you think that the generation of intervention statistics, you need to consider two factors: first, how much data changes after the update? Second, when updating, with what kind of sampling to update?
We know that the data change information (redaction) of a table is recorded in the Rowmodctr field of the sys.sysindexes system table,
After the statistics for the table are updated, the word Chingqing 0, and then accumulates the data changes on the record table again.

  

This information is very useful and provides an important basis for manually updating statistics.
For example, for a table with a 1000W row, you can specify that the changes exceed 20W rows (customized according to business conditions), manually update the statistics,
For a 5000W row table, you can specify that the changes exceed 60W rows (customized according to business conditions), and manually update the statistics
At the same time, depending on the table, on the relatively small table, specify a relatively high percentage of samples, on a relatively large table, specify a relatively low sampling percentage
For example, for 1000W rows of the table, the time to update the statistics of the sampling percentage position 60%, for the 5000W row of the table, update the statistics when the sampling percentage positioning 30%
In this way, you can decide how much data changes and then update the statistics, and dynamically determine the different sampling percentages for different tables to achieve a reasonable goal.
Of course, at the end of the point, I said that every data is relative, not absolute, is only a reference,
Specifically, you have to combine your own server hardware and software to the Environment and Maintenance window time to try, all do not die standard.

Summary: The accuracy of statistical information has a vital impact on the generation of execution plans, this paper gives a rough analysis of the new rules of statistical information and the issues to be noted when updating statistical information.
When the human intervention statistics update, it is necessary to make a reasonable choice between efficiency and accuracy according to the specific situation (table data traffic, server hardware and software Environment, maintenance window period, etc.).

SQL Server Optimization---Statistical information maintenance policy

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.