Using distribution statistics in the DB2 optimizer

Last Update:2017-02-27 Source: Internet

Author: User

Tags db2

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article supporting source code

Brief introduction

In order to execute a query or DML statement (INSERT, UPDATE, DELETE), DB2 must create an access plan (access plans). The access plan defines the order in which the tables are accessed, which indexes are used, and what connection (join) methods are used to correlate the data. A good access plan is critical for fast execution of SQL statements. The DB2 optimizer can create access plans. This is a cost-based optimizer, which means that it makes decisions based on statistics related to tables and indexes. DB2, when generating statistics, not only provides basic statistics, but also allows the creation of so-called distribution statistics. Not only do database administrators understand distribution statistics, but application developers also understand distribution statistics. Application developers must be cautious, because in some cases distributing statistics is important for the DB2 optimizer. The use of master variables or parameter markers (java.sql.PreparedStatement in Java) can be a hindrance to the optimizer's inability to maximize the use of distribution statistics. This article explains what distribution statistics are, where distribution statistics are particularly important, and what application developers should consider in order for the DB2 optimizer to create an effective access plan.

Basic statistical information and distribution statistics

Before we study distribution statistics, we'll look at basic statistics and collect statistics on these tables as soon as we execute RUNSTATS.

Table Related statistics:

Number of pages currently in use

Number of pages containing line of records

Number of rows overflowed

Number of rows in the table (cardinality)

For MDC tables, there is also the number of blocks (block) that contain data

Related statistics for each column in the table:

Cardinality of column

Average length of column

The second largest value in the column

Second small value in the column

Number of NULL values in a column

Typically, when you perform RUNSTATS, you not only collect statistics about tables, but you can also collect relevant statistics about the corresponding indexes. To learn about the statistics collected for indexing, see DB2 administration guide:performance-statistical Information This is collected.

Looking at the basic statistics of a table, you can see that the DB2 optimizer knows how many rows a table consists of (the cardinality of the table) and how many different values a column contains (the column's cardinality). However, some information is not available for basic statistics. For example, basic statistics cannot tell the optimizer how often some values in a column appear. Suppose the table table_x has about 1,000,000 rows and executes such a query on the table:

SELECT * from table_x WHERE column_y = ' value_z '

Is it not important for the DB2 optimizer to know how many rows in the table_x meet the conditions column_y = ' value_z '? In other words: know if this query will return 1 lines, 100 lines, 1000 rows, or 10000 rows what's wrong with that?

In fact, with basic statistics, the DB2 optimizer can only estimate how often ' value_z ' occurs in column_y. In this case, the optimizer thinks that all values are evenly distributed in column_y, which means that it thinks all values have the same frequency of occurrence. If that happens, it's not a big deal. However, if some values appear more frequently than others (for example, if ' value_z ' occurs 900,000 times, which takes up 90% of all rows), then the optimizer cannot take this into account, thus generating an access plan that is not optimal. and distribution statistics can fill this gap. Distribution statistics can provide information on the frequency and distribution of data, and distribution statistics are an important complement to basic statistics if many duplicate values are stored in the database and the data is not evenly distributed in the table.

Types of distribution statistics--frequency (frequency) statistics and statistics for Decimal (quantile) information

There are two different types of distribution statistics--frequency statistics and statistical information of statistics. Let's look at the two different types of distribution statistics with a sample table.

The example table "Cars" represents a car manufacturer that has a corresponding row in the table for each car that is produced. Each car can be identified by its ID, so "ID" is the primary key (PK) of the table "Cars". In addition, there is a "State" column in the table indicating which step the car is currently in the manufacturing process. The manufacturing process for a car starts at step 1th, then 2nd, 3rd, ..., 49th, 50th, 51st, 、...、 98th, 99th, until 100th--the 100th step means the car is finished. The rows for the completed car remain in the table, and subsequent processes (such as complaint management, quality assurance, etc.) still need to be used. Car manufacturers produce 10 different models ("TYPE") cars. To simplify the problem, in this example table, various car models are named A, B, C, D 、...、 J. In addition to the primary key index (on the ID column), there is also an index ("I_state") on the State column and an index ("I_type") on the TYPE column. In fact, a "CARS" table contains far more columns than "ID", "state", and "TYPE". For simplicity, no other columns appear in the sample table.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More