How does sqlserver locate the target data through indexing and Statistics (Article 3)

Source: Internet
Author: User
How does sqlserver locate the target data through indexing and Statistics (Article 3)

There is really no energy to write recentlyArticleWorking overtime every day. In order to complete this series, I had to work hard on my head.

Before reading this article, please read the first and second articles I have previously written.

Article 1:How does sqlserver locate the target data through indexing and Statistics (Article 1)

Article 2:How does sqlserver locate the target data through indexing and Statistics (Article 2)

 

1. Meanings and functions of statistical information

To complete the statement as quickly as possible, indexing alone is not enough. Sqlserver has many methods to complete the same sentence.

Some methods are suitable when the data volume is small, and some methods are suitable when the data volume is large. In the same way, when the data volume is different,

The complexity varies greatly. The index can only help sqlserver find the matching records. Sqlserver also needs to know each operation

The amount of data to be processed to estimate the complexity and select an execution plan with the minimum cost. To put it bluntly, SQL Server must be able

The fastest way to complete commands is to know what the data looks like.

 

SQL Server is not like a human, so looking at the data alone will be able to get a rough idea. So how can we let SQL know the distribution of data?

A common technology in the database management system isStatistics )"

Sqlserver is used to understand the distribution of data.

 

Next, let's take a look at the statistical information of the two example tables in the previous two articles on the salesorderid field, so that we can intuitively understand this concept.

DBO. salesorderheader_test stores the summary information of each order, and only one record is available for one order.

Therefore, salesorderid is not repeated. There should be 31474 records in this table. Salesorderid is an int field,

Therefore, the field length is 4.

Run

 
1 DBCC show_statistics (tablename, index or statistics name)2 3DBCC show_statistics ([salesorderheader_test], salesorderheader_test_cl)

The statistical information is divided into three parts.

1. Statistics Header

Column Name Description

Name indicates the name of the index.

the date and time when the last updated update statistics was made. Here is 12 18 2012 am
this time is very important, based on which he can determine when the statistics are updated
is it true that after the data volume changes, is there a problem where statistics cannot reflect the current
data distribution characteristics

The number of rows in the rows table. There are 31465 rows, which cannot fully reflect the data volume in the current table (because the statistics are not updated in time)

The number of sample rows of the rows sampled statistics is also 31465, indicating the statistics of the last SQL update.
The salesorderid field of all records in the table is scanned.
The statistical information is usually accurate.

In the third part of the statistics, steps divides the data into several groups. Here there are three groups.

The prefix of the first column of density (excluding eq_rows)

Average key length the average length of all columns, because only one column of data type in the salesorderheader_test_cl index is int,

Therefore, the length is 4 (in bytes). If the index has multiple columns, the Data Types of each column are different,

For example, if there is another colc char (10) column, the average length is (10 + 4)/2 = 7.

If the string index is "yes", the statistical information contains the string summary index. The like condition is supported.
The estimation result set is small. Only applicable to Char, varchar, nchar, nvarchar, varchar (max)
Nvarchar (max), text, and ntext data types. Here is int, so this value is "no"

 

2. Data Field Selectivity
Column Name Description

All density indicates the selection of index columns)
"Selective" indicates the repeated data volume in the dataset, or conversely, the unique data volume.
. If the data of a field is rarely repeated, its selectivity is relatively high. For example
ID card number, which cannot be repeated. Even if you want to query the identity records of China, you can use an ID card number.
A maximum of one record can be returned. The filtering conditions on such fields can effectively filter out a large amount of data.
The returned result assembly is relatively small.
For example, gender. There are only two types of people, not male or female. This field is highly repetitive.
Low selectivity. One filter condition can only filter up to half of records
By calculating "selectivity", SQL enables you to predict the approximate number of records after a filtering condition is completed.
The returned density is defined as: density = 1/cardinality of index keys.
If the value is less than 0.1, the index is generally highly selective. If it is greater than 0.1, its selectivity
Not high. Here, [salesorderheader_test] Has 31474 records that are not repeated.
1/31474 = 3.177e-5 this field has good selectivity

Average Length: the average length of the index column. Here it is still 4

The name of the columns index column. The field name is salesorderid.

 

From this part of information, we can infer the length of the field concerned by the statistical information and how many unique values it has. However, this information is not complex enough for the SQL Server prediction result set.

For example, if I want to query an order with salesorderid = 60000, I still don't know how many records will be returned. The third part is required.

 

3. histogram (histogram)
Column Name Description
Maximum value of each set of (STEP) data in the range_hi_key Histogram
The minimum order number is 43659 In the table. Here, the SQL statement selects him as the first step.
The maximum value. The three groups of data are ~ 43659 43660 ~ 75131 75132 ~ 75132

The number of rows in each data range in the range_rows histogram. Except for the upper limit, the first group has only one number: 43659.
The third group also has only one number: 75132. Other data is in the second group, with 31471 in the interval.

The value in the eq_rows table is equal to the upper limit of each group of data in the histogram. Here, the value is 1.

The number of non-repeated values in each data range in the distinct_range_rows histogram. The upper limit is equal to the value of range_rows because this field does not have repeated values.

Avg_range_rows: Average number of repeated values in each data range in the histogram, except for the upper limit. Calculation Formula
(Range_rows/distinct_range_rows for distinct_range_rows> 0)
Here, the value of distinct_range_rows is equal to the value of range_rows, so avg_range_rows is equal to 1.

 

With such a histogram, the data distribution in the table can be well known. In the salesorderid field, the minimum value is 43659,

The maximum value is 75132. There are 31473 values in this range, and there are no duplicate values. Therefore, we can calculate that the value in the table is each int value from 43659 to 75132.

SQL does not need to store a lot of step information. As long as these three steps are used, data distribution can be fully expressed.

 

Here we need to explain two points:

(1) If a statistical information is set up for a group of fields, for example, if a composite index is set up on more than two fields, sqlserver maintains the selective information of all fields,

However, only the histogram of the first field is maintained. Because the number of rows in the first field is the number of rows in the whole table, sqlserver will make statistics even if the field is null in a record.

(2) When the table is relatively large, sqlserver only takes a portion of the table data for sampling (rows sample) to reduce consumption when updating statistics ),

At this time, the data in the statistical information is estimated based on the sampling data, which may be slightly different from the actual value.

 

The more detailed the statistics, the more accurate the statistics will be, but the additional cost of maintaining the statistics will be greater. It is possible to improve the execution performance caused by the accuracy of statistical information.

It cannot offset the increase in maintenance statistics costs. This design is not based on limited capabilities, but to seek a balance suitable for most situations.

 

----------------------------------------- Maintenance and update of Statistical Information ---------------------------------

When SQL server needs to estimate the complexity of an operation, it must try to find corresponding statistics for support.

DBA cannot predict what operations SQL server will perform, so it cannot predict what statistics SQL Server may need.

If we rely on manpower to establish and maintain statistical information, it will be a very complex project. Fortunately, this is not the case with sqlserver.

In most cases, sqlserver maintains and updates statistics well. You do not feel it, and DBA does not have any extra burden.

This is mainly becauseDatabase attributesThere are two default settings

Auto create statistics automatically creates statistics

Auto update statistics automatically updates statistics

They can enable sqlserver to automatically create statistics when necessary, and automatically update statistics when they are found to be out of date.

 

Under what circumstances will sqlserver create statistics?

There are three main cases:

(1) When an index is created, sqlserver will automatically create statistical information on the column where the index is located. Therefore, from a certain perspective, the role of the index is double,

He can help sqlserver to quickly find data, and the above statistics can also tell sqlserver Data Distribution

Supplement: The table statistics are also updated when the index is re-built. Therefore, sometimes the index query becomes faster when the query slows down, which is also one of the reasons.

 

(2) the DBA can also manually create the statistical information he deems necessary through statements such as "create statistics ".

If auto create statistics is enabled to automatically create statistics, it is rarely necessary to manually create

 

(3) When sqserverl wants to use statistics on certain columns and finds no, "auto create statistics automatically creates Statistics"

SQL Server automatically creates statistics.

For example, when a statement needs to filter A (or several) field, or use them to join another table) sqlserver estimates the number of records returned from this table.

At this time, we need a statistical support. If no, sqlserver will automatically create

 

When you open the Database "auto create statistics", you generally do not need to worry that sqlserver does not have enough statistics to select an execution plan.

This is all done for SQL Server Management.

 

Update statistics

SQL Server should not only set up appropriate statistics, but also update them in time so that they can reflect the changes in the data in the table. Data insertion, deletion, and modification may lead to updates of statistics.

However, updating statistics is also a resource-consuming task, especially for large tables. If you modify sqlservr a little bit, you must update the statistics,

It is possible that sqlserver has to be busy, and it is too late to do other things. Sqlserver should balance the accuracy of statistics with the reasonable consumption of resources.

In sql2005/sql2008, the condition for triggering automatic update of statistics is:

(1)If the statistical information is defined on a common table, when one of the following changes occurs, the statistical information is considered outdated. The next update will be triggered.

When you detach a database, you can also manually choose whether to update the statistics.

1. From no data in a table to one or more data records

2. For tables with a data volume less than 500 rows, when the cumulative amount of data in the first field of statistics is greater than 500

3. For tables with a data volume greater than 500 rows, when the cumulative amount of data in the first field of the statistics is greater than-500 + (20% * Total table data. Therefore, for large tables,

SQL recalculates statistics only when more than 1/5 of data changes.

 

(2)The temporary table can contain statistics. The maintenance policy is basically the same as that of a common table. However, statistical information cannot be created on table variables.

 

This maintenance policy ensures that the cost is relatively small and the statistics are basically correct.

 

Difference between SQL2000 and sql2005 in updating statistics:

During sqlserver2000, if sqlservr finds that a statistical information of a table is out of date when compiling a statement,

It will pause statement compilation and switch to updating statistics. After the statistics are updated, it will use new information for execution plans. This method

Of course, it can help to get a more accurate execution plan, but the disadvantage is that statement execution must be completed after statistics are updated. This process is time-consuming.

In most cases, statement execution efficiency is less sensitive to statistics. If you use the old statistical information, you can make a better execution plan,

The waiting here is just waiting.

 

Therefore, after sqlserver2005, the database attribute has an "auto update statistics asynchronously automatic asynchronous update of Statistics" parameter"

When sqlserver discovers that a statistical information is too long, it uses the old statistical information to continue the current query and compilation, but starts a task in the background to update this statistical information.

In this way, when the next statistical information is used, it is already an updated version. The disadvantage is that the accuracy of the Execution Plan of the current query cannot be ensured.

There are advantages and disadvantages in everything. DBA can make choices based on the actual situation.

 

 

Efficiency of select count (*) and select count (1)

In fact, the following statementStatistics are collected based on the statistical information. It is not efficient to update or not the statistical information.

 
1 SelectCount (*)2 3 SelectCount (1)4 5 SelectCount (*)WhereA field = xxx

 

It may take a long time, but there is no way. Most of the content is at the beginning and end of the echo, and the content below may not be understood without the preparation.

Enough for lunch to go to O (Food _ food) O

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.