Objective
After the analysis of the previous few, in fact, in general has been the first glimpse of the importance of SQL Server statistics, so this article will be sacrificed this artifact.
This content will be very long, sit good bench, melon seeds snacks and so on ...
No nonsense, go to the chase.
Technical preparation
The database version is SQL SERVER2008R2, which is analyzed using Microsoft's previous case library (Northwind), and part of the content is also applied to another Microsoft Case Library AdventureWorks
Trust the Friends of SQL Server, both libraries are not too unfamiliar.
Conceptual understanding
About statistics in SQL Server, as explained in Books Online
Query optimization statistics are objects that contain statistics related to the distribution of values in one or more columns of a table or indexed view. The query optimizer uses these statistics to estimate the cardinality or number of rows in the query results. With these cardinality estimates, the query optimizer can create high-quality query plans. For example, the query optimizer can improve query performance by using cardinality estimates to select an index lookup operator instead of an index scan operator that consumes more resources.
In fact, the role of statistical information in the popular point is:SQL Server through the statistical information to understand the database of each table data content distribution, know that the data inside the "long what virtue, to be aware of", so every time you query the statement can be based on the data distribution in the table, Basically you can navigate to the location of the content where you want to find data.
For example, I remember that I used to have an article that wrote the same query, but it produced a completely different query plan, which is basically as follows:
SELECT *'K%' *'Y%'
Exactly the same query statement, just different query conditions, a search for a customer starting with a K, a search for a customer starting with Y, but produced a completely different query plan.
In fact, the reason for this is that statistical information is in trouble.
We know that there is a nonclustered index in the FirstName field of this table, and the goal is to improve the performance of the query statement as above.
But the data in the FirstName field in this table has a customer with 1255 rows starting with K, that is, if the nonclustered index lookup is required to generate 1225 IO operations, this may not be the worst, and the bad is still behind, Because the data fields we get are not all in the FirstName field, but we need additional bookmark lookups to get, and this bookmark lookup generates a lot of random IO operations. Remember: Here is the random io. The way to find here is described in our first article.
So the index scan is more cost-effective than using a nonclustered index, because it scans the data in turn.
The only thing that starts with Y is 37 rows, 37 rows of data are completely retrieved from a nonclustered index, and a portion of the bookmark lookup is clearly a cost-effective way. Because of its small amount of data, the amount of random IO generated is relatively small.
So here's the question:
How does SQL Server know that the number of customers starting with K in the FirstName field in this table will be more, but less with y?
Here is the statistical information in the FirstName, it not only know the contents of the row data in the field "long what kind of", and still know the distribution of each row of data.
In fact, this is like in the library, each bookshelf is a table, and each book is a row of data, the index is like a library book list, such as by category, and statistics is like the number of each kind of books and storage of the Bookshelf location. So when you borrow a book, you need an index to see it, and then use statistics to guide the location so you can get the book.
I hope this explanation, crossing already understand the role of statistical information.
Here to talk more about, there are many children's shoes do not have a deep understanding of the role of indexing and statistical information, after reading a lot of tuned articles, only understand a word: tuning well, create an index on the line.
I do not deny that the creation of indexes is a function of tuning in this way, but many times the technique of indexing is not known. More coincidentally, in most cases, it's a mistake. After creating the index, the performance is really improved, and sometimes the index created is useless, it only affects the performance of the table's other operations (especially insert), and more often a deadlock situation.
And, about the role of indexed items, in fact, in many cases, and do not want you to think so beautiful, follow-up article we will analyze those index failure reasons.
So encounter problems, in fact, through the appearance of understanding its essence, so as to achieve a real target, have a certainty to solve the problem.
Parsing statistics
Let's take a detailed look at the content items in the statistics, we know that in the above statement there is a nonclustered index entry in the ContactName column in table customers, so there is a statistic in that column, so we can see the list of statistics for that table in the following script
Sp_helpstats Customers
Then use the following command to view the details of the statistic, as shown in the code below
DBCC Show_statistics (Customers,contactname)
The content of each statistic contains the contents of the above three parts.
In turn, we analyze how SQL Server understands the distribution of the contents of the column data through these three sections.
A, the overall attribute of the statistics information
This section contains the following columns:
- Name : The names of the statistics.
- Updated: Statistics of the last update time, this time information is very important, according to it we can know when the statistical information is updated, is not the latest, is not the existence of statistical information update is not timely caused by the statistics of the current data distribution is not accurate and so on.
- rows: describes the total number of rows in the current table.
- Rows Sampled: Sampling data for statistical information. When the amount of data is more, the statistical information is obtained by sampling the way of statistics, if the data volume comparison will be scanned by all the more accurate statistics. For example, the sample data in the above example is 91 rows.
- Steps: step value. That is, the number of SQL Server statistics that are grouped according to the data rows. This step value is also determined by SQL Server itself, because the smaller the step, the more detailed the data is described, but the more it consumes, so SQL Server balances the value itself.
- Density: The density value, which is the size of the column value prefix.
- Average Key Length: The average lengths of all columns.
- string Index: Indicates whether the statistic value is a string statistic. The purpose of the string evaluation here is to support the search for the LIKE keyword.
- Filter expression : filtering expressions, this is a new version of SQL Server2008 features, support to add filter expressions, more granular statistical analysis.
- unfiltered rows: rows that are not filtered by an expression are also new features.
Through the above section of the data, statistics have analyzed the column data of the last update time, data volume, data length, data type and other information values.
B. Coverage index entries for statistical information
All density: The density value that reflects the index column. This is a very important value, and SQL Server determines how effective the index will be based on this scoring item.
The score is calculated as: the number of rows in the density=1/table that are not duplicated. So the density value range is: 0-1.
A smaller value indicates that the column has a more selective index entry, which means that the index is more efficient. Ideally, all are distinct values, that is, unique values, so that the number is the smallest.
For example: The above example of the column exists in 91 rows, if the customer does not have the name of the case, then the density value is 1/91=0.010989, the column of sex, then it only has two values: male, female, then the column density value is 0.5, so compared to the SQL The server obviously chooses the ContactName (Customer name) column when the index is selected.
Simple point: Is the current index of high selectivity, its density value is small, then it is less repetitive, so that the filter is easier to find duplicate values. Conversely, repeated values are more selective, such as gender, and one filter can only filter out half of the records.
Average Length: The average size of the index.
Columns: The name of the indexed column. This is because we have a nonclustered index, so there are two rows, one behavior contactname the index column, one behavior contactname the column values of the indexed column and the clustered index CustomerID the combined column. Want to understand here, Index basics.
With the above information, SQL Server will know that the part of the data acquisition method is faster and more efficient.
C, histogram information of statistical information
We then analyze the third part, the column histogram information, through which SQL Server can intuitively "control" the data distribution of the column, we see
- Range_hi_key: The maximum value of each set of data in the histogram. This is a good understanding, if the amount of data is large, after grouping, this value is the maximum value of the current group. The statistics of the above example are divided into 90 groups, a total of only 91 rows, that is, SQL Server in order to accurately describe the value of the column, most of each group took only one value, only one group took two values.
- Range_rows: The number of interval rows of the histogram with no set of data (not including the maximum value). Here we say a total of 91 lines, it is divided into 90 groups, so there is a group of two values, we find it:
- Eq_rows: This represents the number of rows equal to the maximum value above. Because we don't have the same value, so it's worth 1.
- Distinct_range_rows: The number of distinct values for each set of data intervals in the histogram. The upper limit value is excluded.
- Avg_range_rows: The average number of rows per histogram.
After the last part of the description, SQL Server has full control over the data content distribution for that field in the table. Want to get those data according to it can be taken in a leisurely.
So whenever we write a T-SQL statement, it evaluates the amount of data to be fetched based on the statistics and finds the most appropriate execution plan to execute.
I also believe that the above three parts of the analysis, about the beginning of the article on the "K" and the "Y" question will find the answer, here does not explain.
Of course, if the amount of data is particularly large, the maintenance of statistical information will have a small error, and this time we need to stand up and timely make up.
Create statistical information
Through the above introduction, in fact, we have seen the powerful role of statistical information, so it is important for the database is self-evident, so SQL Server will automatically create statistics, timely update statistics, of course, we can close down, but I very much do not recommend this, The reason is simple: no do no die ...
These two features are turned on by default, which means that SQL Server maintains the accuracy of the statistics itself.
In daily maintenance, we do not need to change these two, of course, there are more extreme cases, because we know that update statistics is also a consumption, in a very large concurrent system need to turn off the automatic Update function, this situation is very small, so the basic use of the default value.
SQL Server automatically creates statistics in the following situations:
1. When the index is created, SQL Server automatically creates statistics on the index columns.
2. Statistics are automatically created when SQL Server wants to use statistics on some columns and finds No.
3, of course, we can also create manually.
For example, an example of auto-creation
Select from customerssp_helpstats customersstats
To add a query statement, and then view the statistics
Select from Customersstats where Contactname='Hanna Moos'gosp_helpstats customersstatsgo
Of course, we can also create the script according to our own situation, creating the following
Use [Northwind]gocreate STATISTICS [Coustomersone] on [dbo]. [Customersstats] ([CompanyName]) GO
SQL Server also provides GUI's graphical operation window for easy operation
SQL Server automatically updates statistics in the following situations:
1, if the statistical information is defined on the ordinary table, then when any one of the following changes, the statistics will be triggered by the update action.
- The table from no data becomes greater than or equal to 1 data.
- For a table with a data volume less than 500 rows, when the first field data of the statistic changes more than 500.
- For a table with a data volume greater than 500 rows, when the first field data of the statistic changes more than the 20%* (total data volume of the table). So for larger tables, SQL Server recalculates the statistics only after more than 1/5 of the data has changed.
2, the temporary table can also have statistics information. This is one of the reasons why temporary table optimizations are used in many cases. The maintenance policy is basically the same as a normal table, but table variables cannot create statistics.
The article is a little rough .... But it's a little bit longer .... Come here first ... I'll add some more information about the statistics later on.
About tuning the content is too broad, we put in the future space to introduce, interested can advance attention.
Reference documents
- Refer to the book "Microsoft SQL Server Enterprise platform management practices"
- Refer to the book SQL. server.2005. Technical Insider "series
If you have any questions, you can leave a message or private messages, and look forward to an in-depth study of your child's shoes with SQL Server. Learn together and progress together.
At the end of the article gives the previous several connections, the following content basically covers our daily write query operation decomposition, it seems necessary to sort out a directory ....
SQL Server Tuning Series Basics
SQL Server Tuning Series Basics (Summary of common operators)
SQL Server Tuning Series Basics (Union operator summary)
SQL Server Tuning Series basics (Parallel operations Summary)
SQL Server Tuning Series basics (Parallel operations Summary chapter II)
SQL Server Tuning Series Basics (Index operations Summary)
SQL Server Tuning Series Basics (subquery operations Summary)
-----------------The following step-by-step article-------------------
SQL Server Tuning Series advanced (how the query optimizer runs)
SQL Server Tuning Series Advanced (query statements run several indicator value monitoring)
If you read this blog, feel that you have something to gain, please do not skimp on your " recommendation ".
SQL Server Tuning series advanced (in-depth profiling statistics)