SQL Server in-depth analysis of statistical information learning summary tutorial

Source: Internet
Author: User
Tags microsoft sql server

Preface

After the analysis of the previous several articles, I have gotten a glimpse of the importance of SQL Server statistics, so this article is about to sacrifice this artifact.

The content of this article will be very long. Sit on the bench and enjoy melon seeds and snacks...

No nonsense.

Technical preparation

The database version is SQL Server2008R2, which uses Microsoft's previous case Library (Northwind) for analysis. Some of the content will also be applied to another Microsoft Case Library AdventureWorks.

I believe that SQL Server users are familiar with these two databases.

Concept understanding

The statistical information in SQL Server is explained in books online.

Statistics of query optimization are some objects that contain statistics related to the distribution of values in one or more columns in the table or index View. The query optimizer uses these statistics to estimate the base or number of rows in the query results. With these base estimates, the query optimizer can create high-quality query plans. For example, the query optimizer can use base estimation to select an index search operator instead of an index scan operator that consumes more resources to improve query performance.

In fact, the common feature of statistical information is that SQL Server understands the distribution of data content items in each table in the database through statistical information, and knows what the data in the database looks like ", in this way, each query statement can locate the content location of the data to be searched based on the data distribution in the table.

For example, I remember that I wrote the same query statement in a previous article but produced a completely different query plan. Here I will review the basic information as follows:

SELECT * FROM Person. Contact

WHERE FirstName LIKE 'K %'

SELECT * FROM Person. Contact

WHERE FirstName LIKE 'Y %'

The same query statement only has different query conditions. A customer who starts with K and a customer who starts with Y have different query plans.

In fact, the reason is that the statistical information is at work.

We know that there is a non-clustered index in the FirstName field of this table to improve the performance of the preceding query statements.

However, in this table, there are 1255 rows of data in the FirstName field starting with K, that is, if the non-clustered index search method is used, 1225 IO operations are required, this may not be the worst, but it is still behind, because the data fields we obtain are not all in the FirstName field, and we need to search for additional bookmarks, this bookmarked query produces a large number of random IO operations. Remember: here is random IO. The search method here is described in our first article.

Therefore, compared to the consumption caused by the use of non-clustered indexes, all index scans are more cost-effective, because they scan in sequence to obtain the desired data.

There are only 37 rows starting with Y, and 37 rows of data are completely obtained through non-clustered indexes. It is obviously a very cost-effective method to add a part of the bookmarked search. Because it has a small amount of data, it produces a relatively small amount of random IO.

So here's the question:

How does SQL Server know that there are more customers starting with K in the FirstName field in this table, but less customers starting with Y ?.

The statistical information is at work. It not only knows the content of each row of data in the FirstName field, but also knows the distribution of each row of data.

In fact, this is like in the library, each bookshelf is a table, and each book is a row of data, indexing is like a library book list, such as by class, the statistical information is like the number of books of each type and the location where the books are stored. Therefore, when you borrow a book, you need to use the index to view it and then use the statistical information to guide the location so that you can obtain the book.

I hope this explanation will help you understand the role of the statistics.

Let's talk about it more here. Some children's shoes did not have a deep understanding of index and statistical information. After reading many tuning articles, they only learned one sentence: tuning well, create an index.

I do not deny the role of optimizing indexes, but I do not know much about indexing techniques. What's even more, in most cases, the performance is really improved after the index is created by mistake, but sometimes the index is useless, it will only affect the performance of other table operations (especially Insert), and even worse it will lead to deadlocks.

In addition, the role of index items is not as good as you think in many cases. We will analyze the causes of index failure in subsequent articles.

Therefore, if you encounter problems, you must understand the essence of the problem through the appearance so that you can solve the problem with a real purpose and confidence.

Parse statistics

Let's analyze the content items in the statistical information in detail. We know that in the preceding statement, there is a non-clustered index item in the ContactName column of the Customers table, so there is a statistical information in this column, you can run the following script to view the statistics list of the table.

Sp_helpstats MERS

Run the following command to view the details of the statistics:

DBCC SHOW_STATISTICS (Mers MERS, ContactName)

Each statistical information contains the preceding three parts.

We will analyze these three parts in sequence. How does SQL Server know the content distribution of the column data.

A. Overall attribute items of statistics

This part contains the following columns:

Name: the Name of the statistics.

Updated: The last update time of the statistical information. The time information is very important. Based on this information, we can know when the statistical information is Updated. Is it the latest, is there a problem such as inaccurate distribution of the current statistical data due to untimely updates of statistical information.

Rows: describes the total number of Rows in the current table.

Rows Sampled: sampling data of statistical information. When there is a large amount of data, the statistical information is obtained by sampling. If the data volume is compared, the precise statistical value is obtained by scanning all data. For example, in the example above, the sample data is 91 rows.

Steps: step size. That is, the SQL Server statistics are based on the number of groups in the data rows. The step size value is determined by the SQL Server itself. Because the smaller the step size, the more detailed the data is described, but the more consumed, the SQL Server balances the value by itself.

Density: the size of the column value prefix.

Average Key length: Average length of all columns.

String Index: indicates whether the statistical value is a String. Here, the string is evaluated to support the LIKE keyword search.

Filter Expression: a Filter Expression. It is a new feature of SQL Server and later versions. It supports adding Filter expressions for more fine-grained statistical analysis.

Unfiltered Rows: a row that is not filtered by an expression. It is also a new feature.

After the preceding data, the statistical information has analyzed the latest update time, data volume, data length, data type, and other information values of the column.

B. Covered index items of statistical information

All density: indicates the density of the index column. This is a very important value. SQL Server determines the degree of effectiveness of the index based on this score.

The formula for calculating this score is: density = 1/the number of non-repeated rows in the table. Therefore, the value range of this density value is 0-1.

The smaller the value, the more selective the index items of the column, that is, the more effective the index. Ideally, all are non-repeated values, that is, they are all unique values, so that they have the minimum number.

For example, in the preceding example, there are 91 rows in this column. If the customer does not have a duplicate name, the density value is 1/91 = 0.010989. This column is a gender column, there are only two values: male and female. The density value of this column is 0.5, therefore, when selecting an index, SQL Server will obviously select the ContactName column.

To put it simply, the index is highly selective and its density value is small, so it has fewer duplicate values, making it easier to find the result value during filtering. On the contrary, multiple duplicate values have poor selectivity, such as gender. Only half of records can be filtered out at a time.

Average Length: the Average Length of the index.

Columns: name of the index column. Here, because we are not clustered indexes, there will be two rows, one row is the ContactName index column, one row is the ContactName index column and the clustered index column value CustomerID combination column. I hope to understand the basic indexing knowledge here.

With the above information, SQL Server will know that the data acquisition method of this part is faster and more effective.

C. Histogram information of statistical information

Next, we will analyze the third part, the histogram information of this column. Through this SQL Server, we can intuitively "control" the data distribution of this column.

RANGE_HI_KEY: maximum value of each set of data in the histogram. It is easy to understand that, if the data volume is large, the value is the maximum value of the current group after grouping. In the preceding example, the statistical information is divided into 90 groups and there are only 91 rows in total. That is to say, to accurately describe the value of this column, SQL Server takes only one value for each group, only one group has two values.

RANGE_ROWS: number of rows in the intervals of no data in the histogram (excluding the maximum value ). Here we talk about a total of 91 rows, which are divided into 90 groups, so there will be two values in one group, we find it:

EQ_ROWS: indicates the number of rows equal to the maximum value above. Because we do not include the same, so here the value is 1

DISTINCT_RANGE_ROWS: number of non-repeated values in each data range of the histogram. Except the upper limit.

AVG_RANGE_ROWS: the average number of rows in each histogram.

As described in the last section, SQL Server has full control over the data content distribution of this field in the table. You can easily obtain the data based on it.

So every time we write a T-SQL statement, it can evaluate the amount of data to be obtained based on statistics and find the most appropriate execution plan to execute.

I also believe that after the analysis in the above three parts, the question about 'K' and 'Y' we mentioned at the beginning of the article will find the answer. I will not explain it here.

Of course, if the data volume is too large, maintenance of statistics may also cause minor errors. At this time, we need to make up for it in time.

Create statistics

Through the above introduction, we have seen the powerful role of statistical information, so the importance of it for the database is self-evident. Therefore, SQL Server will automatically create statistical information, update the statistics in due time. Of course, we can disable it, but I Do not recommend this. The reason is very simple: No Do No Die...

These two functions are enabled by default, that is, SQL Server maintains the accuracy of statistics by itself.

In daily maintenance, we do not have to change these two items. Of course, there are also extreme situations, because we know that updating statistics is also a consumption, in a very large concurrent system, you need to turn off the automatic update function. In this case, the default value is basically enough.

In the following cases, SQL Server automatically creates statistics:

1. When an index is created, SQL Server automatically creates statistics on the index column.

2. When SQL Server wants to use the statistics of certain columns and finds no, it will automatically create statistics.

3. Of course, you can also create one manually.

For example, an example of automatic creation

Select * into mermersstats from MERS

Sp_helpstats CustomersStats

To add a query statement, and then view the statistics

Select * from CustomersStats

Where ContactName = 'Hanna Moos'

Go

Sp_helpstats CustomersStats

Go

Of course, you can also manually create one based on your own situation. The creation script is as follows:

USE [Northwind]

GO

Create statistics [CoustomersOne] ON [dbo]. [mermersstats] ([CompanyName])

GO

SQL Server also provides graphic operation windows for GUI operations.

In the following cases, SQL Server automatically updates statistics:

1. If the statistical information is defined on a common table, the statistical information will be triggered when any of the following changes occur.

The number of data records in a table is greater than or equal to 1.

For tables with a data volume less than 500 rows, when the data in the first field of the statistics changes more than 500.

For tables with a data volume greater than 500 rows, when the data in the first field of the statistics changes more than 500 (20% * Total table data volume. Therefore, SQL Server recalculates statistics for large tables only when more than 1/5 of the data changes.

2. Statistical information can also be found in the temporary table. This is also one of the reasons for using temporary table optimization in many cases. Its maintenance policy is basically the same as that of a common table, but table variables cannot create statistics.

Of course, we can also manually update the statistics. The update script is as follows:

Update statistics Customers WITH FULLSCAN

The article is a bit rough... but the length is a little longer... come here... I will add a part of the statistical information later.

The optimization content is too broad. We will introduce it in the future. If you are interested, please pay attention to it in advance.

References

See the Microsoft SQL Server Enterprise platform management practices

Reference books: SQL. Server.2005. technical insider series

If you have any questions, you can leave a message or send a private message. We are always waiting for you to join SQL SERVER for further research. Learn together and make progress together.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.