SQL server-in-depth profiling statistics

Source: Internet
Author: User
Tags microsoft sql server

Transferred from: http://www.cnblogs.com/zhijianliutang/p/4190669.html

Conceptual understanding

About statistics in SQL Server, as explained in Books Online

Query optimization statistics are objects that contain statistics related to the distribution of values in one or more columns of a table or indexed view. The query optimizer uses these statistics to estimate the cardinality or number of rows in the query results. With these cardinality estimates, the query optimizer can create high-quality query plans. For example, the query optimizer can improve query performance by using cardinality estimates to select an index lookup operator instead of an index scan operator that consumes more resources.

In fact, the role of statistics on the popular point is: SQL Server through the statistical information to understand the database of each table data content distribution, know that the data inside the "long what virtue, to be aware of", so every time the query statement can be based on the data distribution in the table, the basic can be located to find the content location of the data.

For example, I remember that I used to have an article that wrote the same query, but it produced a completely different query plan, which is basically as follows:

SELECT * from Person.contactwhere FirstName like ' K% ' SELECT * from person.contactwhere FirstName like ' Y% '

Exactly the same query statement, just different query conditions, a search for a customer starting with a K, a search for a customer starting with Y, but produced a completely different query plan.

In fact, the reason for this is that statistical information is in trouble.

We know that there is a nonclustered index in the FirstName field of this table, and the goal is to improve the performance of the query statement as above.

But the data in the FirstName field in this table has a customer with 1255 rows starting with K, that is, if the nonclustered index lookup is required to generate 1225 IO operations, this may not be the worst, and the bad is still behind, Because the data fields we get are not all in the FirstName field, but we need additional bookmark lookups to get, and this bookmark lookup generates a lot of random IO operations. Remember: Here is the random io. The way to find here is described in our first article.

So the index scan is more cost-effective than using a nonclustered index, because it scans the data in turn.

The only thing that starts with Y is 37 rows, 37 rows of data are completely retrieved from a nonclustered index, and a portion of the bookmark lookup is clearly a cost-effective way. Because of its small amount of data, the amount of random IO generated is relatively small.

So here's the question:

How does SQL Server know that the number of customers starting with K in the FirstName field in this table will be more, but less with y?

Here is the statistical information in the FirstName, it not only know the contents of the row data in the field "long what kind of", and still know the distribution of each row of data.

In fact, this is like in the library, each bookshelf is a table, and each book is a row of data, the index is like a library book list, such as by category, and statistics is like the number of each kind of books and storage of the Bookshelf location. So when you borrow a book, you need an index to see it, and then use statistics to guide the location so you can get the book.

I hope this explanation, crossing already understand the role of statistical information.

Here to talk more about, there are many children's shoes do not have a deep understanding of the role of indexing and statistical information, after reading a lot of tuned articles, only understand a word: tuning well, create an index on the line.

I do not deny that the creation of indexes is a function of tuning in this way, but many times the technique of indexing is not known. More coincidentally, in most cases, it's a mistake. After creating the index, the performance is really improved, and sometimes the index created is useless, it only affects the performance of the table's other operations (especially insert), and more often a deadlock situation.

And, about the role of indexed items, in fact, in many cases, and do not want you to think so beautiful, follow-up article we will analyze those index failure reasons.

So encounter problems, in fact, through the appearance of understanding its essence, so as to achieve a real target, have a certainty to solve the problem.

Parsing statistics

Let's take a detailed look at the content items in the statistics, we know that in the above statement there is a nonclustered index entry in the ContactName column in table customers, so there is a statistic in that column, so we can see the list of statistics for that table in the following script

Sp_helpstats Customers

Then use the following command to view the details of the statistic, as shown in the code below

DBCC Show_statistics (Customers,contactname)

The content of each statistic contains the contents of the above three parts.

In turn, we analyze how SQL Server understands the distribution of the contents of the column data through these three sections.

A, the overall attribute of the statistics information

This section contains the following columns:

    • Name: The names of the statistics.
    • Updated: Statistics of the last update time, this time information is very important, according to it we can know when the statistical information is updated, is not the latest, is not the existence of statistical information update is not timely caused by the statistics of the current data distribution is not accurate and so on.
    • Rows: Describes the total number of rows in the current table.
    • Rows Sampled: Sampling data for statistical information. When the amount of data is more, the statistical information is obtained by sampling the way of statistics, if the data volume comparison will be scanned by all the more accurate statistics. For example, the sample data in the above example is 91 rows.
    • Steps: Step value. That is, the number of SQL Server statistics that are grouped according to the data rows. This step value is also determined by SQL Server itself, because the smaller the step, the more detailed the data is described, but the more it consumes, so SQL Server balances the value itself.
    • Density: The density value, which is the size of the column value prefix.
    • Average Key Length: The average lengths of all columns.
    • String Index: Indicates whether the statistic value is a string statistic. The purpose of the string evaluation here is to support the search for the LIKE keyword.
    • Filter expression: Filtering expressions, this is a new version of SQL Server2008 features, support to add filter expressions, more granular statistical analysis.
    • Unfiltered rows: Rows that are not filtered by an expression are also new features.

Through the above section of the data, statistics have analyzed the column data of the last update time, data volume, data length, data type and other information values.

B. Coverage index entries for statistical information

All density: The density value that reflects the index column. This is a very important value, and SQL Server determines how effective the index will be based on this scoring item.

The score is calculated as: the number of rows in the density=1/table that are not duplicated. So the density value range is: 0-1.

A smaller value indicates that the column has a more selective index entry, which means that the index is more efficient. Ideally, all are distinct values, that is, unique values, so that the number is the smallest.

For example: The above example of the column exists in 91 rows, if the customer does not have the name of the case, then the density value is 1/91=0.010989, the column of sex, then it only has two values: male, female, then the column density value is 0.5, so compared to the SQL The server obviously chooses the ContactName (Customer name) column when the index is selected.

Simple point: Is the current index of high selectivity, its density value is small, then it is less repetitive, so that the filter is easier to find the result value. Conversely, repeated values are more selective, such as gender, and one filter can only filter out half of the records.

Average Length: The average size of the index.

Columns: The name of the indexed column. This is because we have a nonclustered index, so there are two rows, one behavior contactname the index column, one behavior contactname the column values of the indexed column and the clustered index CustomerID the combined column. Want to understand here, Index basics.

With the above information, SQL Server will know that the part of the data acquisition method is faster and more efficient.

C, histogram information of statistical information

We then analyze the third part, the column histogram information, through which SQL Server can intuitively "control" the data distribution of the column, we see

    • Range_hi_key: The maximum value of each set of data in the histogram. This is a good understanding, if the amount of data is large, after grouping, this value is the maximum value of the current group. The statistics of the above example are divided into 90 groups, a total of only 91 rows, that is, SQL Server in order to accurately describe the value of the column, most of each group took only one value, only one group took two values.
    • Range_rows: The number of interval rows of the histogram with no set of data (not including the maximum value). Here we say a total of 91 lines, it is divided into 90 groups, so there is a group of two values, we find it:
    • Eq_rows: This represents the number of rows equal to the maximum value above. Because we don't have the same value, so it's worth 1.
    • Distinct_range_rows: The number of distinct values for each set of data intervals in the histogram. The upper limit value is excluded.
    • Avg_range_rows: The average number of rows per histogram.

After the last part of the description, SQL Server has full control over the data content distribution for that field in the table. Want to get those data according to it can be easily obtained, and the statistics are sorted.

So whenever we write a T-SQL statement, it evaluates the amount of data to be fetched based on the statistics and finds the most appropriate execution plan to execute.

I also believe that the above three parts of the analysis, about the beginning of the article on the "K" and the "Y" question will find the answer, here does not explain.

Of course, if the amount of data is particularly large, the maintenance of statistical information will have a small error, and this time we need to stand up and timely make up.

Create statistical information

Through the above introduction, in fact, we have seen the powerful role of statistical information, so it is important for the database is self-evident, so SQL Server will automatically create statistics, timely update statistics, of course, we can close down, but I very much do not recommend this, The reason is simple: no do no die ...

These two features are turned on by default, which means that SQL Server maintains the accuracy of the statistics itself.

In daily maintenance, we do not need to change these two, of course, there are more extreme cases, because we know that update statistics is also a consumption, in a very large concurrent system need to turn off the automatic Update function, this situation is very small, so the basic use of the default value.

SQL Server automatically creates statistics in the following situations:

1. When the index is created, SQL Server automatically creates statistics on the index columns.

2. Statistics are automatically created when SQL Server wants to use statistics on some columns and finds No.

3, of course, we can also create manually.

For example, an example of auto-creation

SELECT * into Customersstats from Customerssp_helpstats customersstats

To add a query statement, and then view the statistics

SELECT * from Customersstatswhere contactname= ' Hanna Moos ' gosp_helpstats customersstatsgo

Of course, we can also create the script according to our own situation, creating the following

Use [Northwind]gocreate STATISTICS [Coustomersone] on [dbo]. [Customersstats] ([CompanyName]) GO

SQL Server also provides GUI's graphical operation window for easy operation

SQL Server automatically updates statistics in the following situations:

1, if the statistical information is defined on the ordinary table, then when any one of the following changes, the statistics will be triggered by the update action.

    • The table from no data becomes greater than or equal to 1 data.
    • For a table with a data volume less than 500 rows, when the first field data of the statistic changes more than 500.
    • For a table with a data volume greater than 500 rows, when the first field data of the statistic changes more than the 20%* (total data volume of the table). So for larger tables, SQL Server recalculates the statistics only after more than 1/5 of the data has changed.

2, the temporary table can also have statistics information. This is one of the reasons why temporary table optimizations are used in many cases. The maintenance policy is basically the same as a normal table, but table variables cannot create statistics.

Of course, we can also manually update the statistics, update the script as follows:

UPDATE STATISTICS Customers with FULLSCAN

The article is a little rough .... But it's a little bit longer .... Come here first ... I'll add some more information about the statistics later on.

About tuning the content is too broad, we put in the future space to introduce, interested can advance attention.

Reference documents

    • Refer to the book "Microsoft SQL Server Enterprise platform management practices"
    • Refer to the book SQL. server.2005. Technical Insider "series

SQL server-in-depth profiling statistics

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.