This article describes how to use distribution statistics in the DB2 optimizer? We all know that the DB2 optimizer is a cost-based optimizer. It makes a decision based on the statistical information of the table and the actual operation index. DB2 not only provides basic statistics, but also allows the creation of so-called distribution statistics.

This article explains the situations where distribution statistics and distribution statistics are particularly important and what application developers should consider in order for the DB2 optimizer to create an effective access plan.

Introduction

To execute queries or DML statements INSERT, UPDATE, and DELETE), DB2 must create an access plan ). The access plan defines the order in which tables are accessed, the indexes used, and the join method used to associate data. A good access plan is crucial for quick execution of SQL statements. The DB2 optimizer can create an access plan. This is a cost-based optimizer, which means that it makes decisions based on the statistical information of tables and indexes.

DB2 not only provides basic statistics, but also allows the creation of so-called distribution statistics. Not only does the database administrator need to understand the distribution statistics, but also application developers need to understand the distribution statistics. Application developers must be careful because distribution statistics are important to the DB2 optimizer in some cases.

The use of primary variables or parameter tags in Java is java. SQL. PreparedStatement) may cause obstacles, so that the optimizer cannot take advantage of distribution statistics to the maximum extent. This article explains the situations where distribution statistics and distribution statistics are particularly important and what application developers should consider in order for the DB2 optimizer to create an effective access plan.

Basic statistics and distribution statistics

Before studying the distribution statistics, let's take a look at the basic statistics. You only need to execute RUNSTATS to collect statistics about these tables.

Table statistics:

Currently used page count

Number of pages containing record rows

Number of overflow rows

Base number of rows in the table)

For MDC tables, there are also blocks containing data.

Related Statistics of each column in the table:

Column Base

Average Column Length

The second largest value in the column.

The second small value in the column.

Number of NULL values in the column

Generally, when running RUNSTATS, you can collect not only statistics about the table, but also related statistics about the corresponding index. For more information about the statistics collected for indexing, see DB2 Administration Guide: Performance-Statistical information that is collected.

Observe the basic statistics of a table. You can see that the DB2 optimizer knows the base number of rows in a table ), and the base number of different value columns in a column ). However, there is still some information that cannot be provided by basic statistics. For example, the basic statistical information cannot tell the optimizer how often certain values appear in a column. Assume that the table TABLE_X contains about 1,000,000 rows, and execute the following query on the table:

SELECT * FROM TABLE_X WHERE COLUMN_Y = 'value _ Z'

Does the DB2 optimizer know how many rows in TABLE_X meet the condition COLUMN_Y = 'value _ Z? In other words: Do you know that this query will return 1, 100, 1000, or 10000 rows?

In fact, the DB2 optimizer can only estimate the frequency of 'value _ Z' in COLUMN_Y based on basic statistics. In this case, the optimizer considers that all values are evenly distributed in COLUMN_Y, which means that all values have the same occurrence frequency. If this happens, this would not be a problem. However, if some values appear more frequently than other values, for example, if 'value _ Z' appears 900,000 times, that is, 90% of all rows ).

The DB2 optimizer cannot take this into account, so the generated access plan is not optimal. Distribution Statistics can fill this gap. The distribution statistics can provide information about the frequency and distribution of data. If many duplicate values are stored in the database and the data is not evenly distributed in the table, distribution statistics are an important supplement to basic statistics.

Type of Distribution Statistics-frequency) statistical information and quantile) Statistical information

There are two different types of Distribution Statistics: Frequency Statistics and quantile statistics. Let's use an example table to study the distribution statistics of these two different types.

The example table "CARS" indicates a car manufacturer. Each car produced has a corresponding row in the table. Each vehicle can be identified by its ID, so "ID" is the primary key of the table "CARS ). In addition, there is a "STATE" column in the table, indicating the current step of the automobile in the manufacturing process. The manufacturing process of an automobile starts from step 1, followed by Step 2 and Step 2 ,..., step 1, step 2, step 2 ,... step 1 and step 2.

Until Step 2 -- step 2 means the car is finished. The rows of finished vehicles are still in the table, and will still be used in subsequent processes, such as complaint management and quality assurance. Automobile manufacturers produce 10 types of vehicles with different types. To simplify the problem, in this example table, various vehicle models are named A, B, C, D,..., and J. Except for the primary key index in the "ID" column.

The "STATE" column also has an index "I _STATE"), and the "TYPE" column also has an index "I _TYPE "). In fact, a "CARS" table contains more than "ID", "STATE", and "TYPE" columns ". For simplicity, no other columns appear in the example table.

Frequency Statistics

Assume that the table CARS now has about 1,000,000 records. The frequency of different models in the table is as follows:

Table 1. Frequency Statistics of TYPE columns in CARS

- TYPE COUNT(TYPE)
- A 506135
- B 301985
- C 104105
- D 52492
- E 19584
- F 10123
- G 4876
- H 4589
- I 4403
- J 3727

Models A are most favored by buyers, so about 50% of the cars produced are of this type. Model B and Model C are second only to model A, accounting for 30% and 10% of all vehicles respectively. All other models account for only 10% of the total.

The above table shows the frequency statistics of the "TYPE" column. With the basic statistics, the DB2 optimizer can only understand the base number of the table containing 1,000,000 rows) and 10 different value models), that is, A to J. If there is no distribution statistics, the optimizer considers that each value appears at the same frequency, about 100,000 times.

Once the distribution statistics of the "TYPE" column are generated, the optimizer can understand the true frequency of each model. Therefore, the DB2 optimizer is clear about the frequency of different existing models.

The optimizer uses frequency statistics to calculate the filter factor used to check the equality or inequality of predicates. For example:

- SELECT * FROM CARS WHERE TYPE = 'H'

Quantile statistics

Different from the frequency statistics, quantile statistics are irrelevant to the frequency of occurrence of different values, and are related to the number of rows in a table that are less than or greater than a value or that have rows between two values. Quantile statistics provide information about whether values in a column are aggregated. To obtain such information, DB2 assumes that the values in the column are arranged in ascending order and the values are determined based on the regular row interval.

Let's take a look at the "STATE" column in the table CARS, which is arranged in ascending order. The corresponding value of "STATE" can be determined based on the regular row interval.

Table 2. quantile statistics of the STATE column in the CARS table

- COUNT(row) STATE ASC
- 5479 1
- 54948 10
- 109990 21
- 159885 31
- 215050 42
- 265251 52
- 320167 63
- 370057 73
- 424872 84
- 475087 94
- 504298 100
- ... 100
- 1012019 100

Because the number of finished cars is still not deleted from the table, the number of cars in the status of 100 = finished) is more than the total number of all cars in other States. Finished vehicles account for 50% of all records in the table.

Note: In actual situations, the number of finished cars may even exceed 99% ). In the following example, we can see this situation.

The table above shows quantile statistics for the "STATE" column. With this information about the number of rows that are smaller than or greater than the specified value, the optimizer can calculate the values that are used to test the rows that are smaller than or equal to) and greater than or equal) or a filter factor between the predicates. For example:

- SELECT * FROM CARS WHERE STATE < 100
- SELECT * FROM CARS WHERE STATE BETWEEN 50 AND 70

The filter factor calculated based on the existing quantile statistics is not very accurate, but even if only 20 values are collected, the error is still less than 5%.

For details, visit:

Http://www.ibm.com/developerworks/cn/data/library/techarticles/dm-0606fechner/