Beware of nothing, it will overgrown. The tree is in. Mountain in. The earth is in. Years on. I'm here. What better world do you want? --Zhang Xiaofeng "I'm In"
Why do you write this content?
With the accumulation of work experience, it is increasingly felt that a large number of relational database performance problems, the root is the statistical information. Here is the root cause, in fact, many times we feel that those index failure, and so on are just appearances. Of course, not generalize, there are many problems such as configuration problems, design problems and so on, even power can affect performance.
The reason for this conclusion is that, in the general development and deployment process, the general enterprise system has used a large number of higher-level disk arrays or even enterprise-level ssd,io problems have been very few, and the theoretical development of relational databases can not keep up with the speed of hardware development, So the system-level performance problems with hardware performance problems are very low.
The database design problem, of course, is also very important, but for a few years of the system, tinkering is almost.
For the coding problem, it is better to combine with the index, many times it is not an index problem, but your writing results in the optimizer does not use the index. In addition, if it is an index problem (no index, no index, no use, too many indexes, etc.), you can look at my other two index related articles:
1. SQL Server Index Maintenance (1)--How to get Index usage http://blog.csdn.net/dba_huangzj/article/details/49735941
2. SQL Server Index Maintenance (1)--common indexing problems in the system http://blog.csdn.net/dba_huangzj/article/details/49734963
The points mentioned above are relatively easy to find and control, but the statistical information mentioned in this series is more secretive and difficult to detect because it rarely shows directly or indirectly causing other problems. So I think it is necessary to write this topic.
Always: statistics are designed to assist one of the core components of a database management system-the Performance Optimizer/query optimizer generates an appropriate execution plan to respond to requests as efficiently as possible.
What is statistical information
Before studying the performance problem, it is very necessary to study the statistical information, because there is no statistical information, relational database is a fool. Without statistics, the optimizer often uses unreasonable assumptions to generate execution plans that can seriously affect sex.
So what information does the database need? First of all, to talk about data storage, a relational database uses blocks or pages as the smallest storage unit, usually the 8k/4k size. In other words, if you only store 1K of data, there is 7k/3k space to waste. So how does the database know how the data is stored so that it can have a "reliable" basis for later maintenance and use of the database? The answer is, of course, statistical information.
Back on the statistics, when the database is asked to collect statistics, it usually collects the content:
- How many rows/pages does the table have?
- For each column in the table, the length information for the unique value (longest, shortest, average), range of data values (maximum, minimum, average)
- Index information above the table
These statistics can help the optimizer evaluate the CPU, I/O, memory, and other resources that the statement may want to use.
It is important to also list the statistics above, such as Table A, which uses two columns in association: FirstName and LastName, the database knows that FirstName has 1000 different values, and LastName has 100,000 different values. As a result, the database is associated with LastName FirstName when associated, because in LastName, because there are a large number of different values, it is often sufficient to match only the first two or three characters in a column match.
But this is just basic statistics, but in addition to having the database collect other information such as histograms (histograms), the histogram describes the data distribution of the column, such as:
- The most frequently occurring value.
- Number of bits, etc.
Accurate statistics can assist the optimizer in generating a better execution plan. In particular, there are qualifiers with the = symbol (such as where age=18). or a range qualifier (where age>18 and age<40. Because of the statistical information, the database already knows the distribution of this part of the data. (The professional term is called the selectivity of choice).
Since the importance of statistics must be updated in real time, there is nothing worse than a database that considers a 1 million-row table with only 500 rows. The only disadvantage of statistical information is the need for resources to calculate accurate information. This is why many database management systems do not automatically update statistics by default. It is difficult to calculate information for millions of data even today. Therefore, you can calculate only the basic statistics or a sample of the database.
For example, when you're dealing with hundreds of of millions of rows of data, counting only 10% of statistics can be beneficial. However, if possible, the accuracy of statistical information should be ensured, because inaccurate statistics can turn a 30-second query into a 8-hour run.
Introduction to SQL Server statistics and environment preparation
Because I have been working on SQL Server, unfortunately, in the demonstration, the introduction of SQL Servers as an example, but I try to approach the relational database layer, not limited to a product. The next section is devoted to SQL Server statistics, so here's a simple overview:
More than 90% of the users concerned about the database, in addition to the function to meet the requirements, the rest of the focus is undoubtedly on performance. So how to get high performance? The root cause is the creation of an appropriate and efficient execution plan for the database (how to generate the execution plan is not covered in this series). As mentioned earlier, statistics are intended to assist the optimizer in generating execution plans, and accurate, useful statistics can maximize the information needed by the optimizer to generate an efficient execution plan (this is a complex process, which has many factors that affect the execution plan generation). Inaccurate statistics can turn a second-level query into a slow query for several hours. So the statistical information is a double-edged sword as well as an index. We can simplify this passage to:
queries, bad queries, and statistical information
Back to SQL Server, performance is a key point when we need to get data from a database (system exceptions, failures, logic errors are not considered here). If SQL Server can get accurate and valid information about the data from the database, the server can choose a more efficient way to perform the data query (performance Plan). The key data sources for this information are "statistics." If the statistics are outdated/incorrect, performance will become unacceptable.
The query you pass to SQL Server is first passed to the SQL Server query Optimizer, and if SQL Server does not have a cache execution plan, it will generate a query plan for the query. During the build process, the SQL Server optimizer chooses the appropriate physical operator that corresponds to the logical operator (such as join, etc.) to perform at the physical level (note: SQL, Structured Query language, is to let the user tell the database what data or operation to need, Instead of telling the database how to get the data and operations, the SQL language is transferred to the DBMS and needs to be translated into operations that the computer can recognize, which appears to be a series of physical operators at a high level. When selecting a physical operator, a logical operator may be implemented using multiple physical operators, such as the where xxx=xxxx in the logical operator, which may be implemented by using index Lookup (index seek) or table scan in a physical operator, For example, for various joins in SQL statements, the optional physical operators are hash, merge, and nested Loop Joins.
In the real world, there are many factors that affect the optimizer's choice of physical operators, the most important of which is the cardinality estimate (cardinality estimations), which calculates the number of rows that might be left to match after the filter operation has been applied. Inaccurate cardinality estimates generate an execution plan that is a few orders of magnitude lower than an accurate execution plan (I really see the difference between the hour and the second level, which is because of statistics). In addition, these cardinality estimates also affect the number of issues such as parallel execution, the order of joins, and even memory allocation.
For subsequent demonstrations, this series mainly uses SQL Server 2014 to customize a database and test data to do the demo, if there is no SQL 2014 does not matter, this only affects the implementation of the SQL 2014 new cardinality estimate, the majority of the contents of this series in the SQL Server full series (2000 before the exemption, but I suggest still 2008 as the starting point) or generic.
This article is just a brief introduction, for many of the things mentioned here, will be mentioned in the following several articles, written links will be followed by post.
- All nonsense SQL Server statistics (2)--Basics of statistical information
All nonsense SQL Server statistics (1)--Introduction to statistical information