Using distribution statistics in the DB2 optimizer

Source: Internet
Author: User
Tags db2

This article supporting source code

Brief introduction

In order to execute a query or DML statement (INSERT, UPDATE, DELETE), DB2 must create an access plan (access plans). The access plan defines the order in which the tables are accessed, which indexes are used, and what connection (join) methods are used to correlate the data. A good access plan is critical for fast execution of SQL statements. The DB2 optimizer can create access plans. This is a cost-based optimizer, which means that it makes decisions based on statistics related to tables and indexes. DB2, when generating statistics, not only provides basic statistics, but also allows the creation of so-called distribution statistics. Not only do database administrators understand distribution statistics, but application developers also understand distribution statistics. Application developers must be cautious, because in some cases distributing statistics is important for the DB2 optimizer. The use of master variables or parameter markers (java.sql.PreparedStatement in Java) can be a hindrance to the optimizer's inability to maximize the use of distribution statistics. This article explains what distribution statistics are, where distribution statistics are particularly important, and what application developers should consider in order for the DB2 optimizer to create an effective access plan.

Basic statistical information and distribution statistics

Before we study distribution statistics, we'll look at basic statistics and collect statistics on these tables as soon as we execute RUNSTATS.

Table Related statistics:

Number of pages currently in use

Number of pages containing line of records

Number of rows overflowed

Number of rows in the table (cardinality)

For MDC tables, there is also the number of blocks (block) that contain data

Related statistics for each column in the table:

Cardinality of column

Average length of column

The second largest value in the column

Second small value in the column

Number of NULL values in a column

Typically, when you perform RUNSTATS, you not only collect statistics about tables, but you can also collect relevant statistics about the corresponding indexes. To learn about the statistics collected for indexing, see DB2 administration guide:performance-statistical Information This is collected.

Looking at the basic statistics of a table, you can see that the DB2 optimizer knows how many rows a table consists of (the cardinality of the table) and how many different values a column contains (the column's cardinality). However, some information is not available for basic statistics. For example, basic statistics cannot tell the optimizer how often some values in a column appear. Suppose the table table_x has about 1,000,000 rows and executes such a query on the table:

SELECT * from table_x WHERE column_y = ' value_z '

Is it not important for the DB2 optimizer to know how many rows in the table_x meet the conditions column_y = ' value_z '? In other words: know if this query will return 1 lines, 100 lines, 1000 rows, or 10000 rows what's wrong with that?

In fact, with basic statistics, the DB2 optimizer can only estimate how often ' value_z ' occurs in column_y. In this case, the optimizer thinks that all values are evenly distributed in column_y, which means that it thinks all values have the same frequency of occurrence. If that happens, it's not a big deal. However, if some values appear more frequently than others (for example, if ' value_z ' occurs 900,000 times, which takes up 90% of all rows), then the optimizer cannot take this into account, thus generating an access plan that is not optimal. and distribution statistics can fill this gap. Distribution statistics can provide information on the frequency and distribution of data, and distribution statistics are an important complement to basic statistics if many duplicate values are stored in the database and the data is not evenly distributed in the table.

Types of distribution statistics--frequency (frequency) statistics and statistics for Decimal (quantile) information

There are two different types of distribution statistics--frequency statistics and statistical information of statistics. Let's look at the two different types of distribution statistics with a sample table.

The example table "Cars" represents a car manufacturer that has a corresponding row in the table for each car that is produced. Each car can be identified by its ID, so "ID" is the primary key (PK) of the table "Cars". In addition, there is a "State" column in the table indicating which step the car is currently in the manufacturing process. The manufacturing process for a car starts at step 1th, then 2nd, 3rd, ..., 49th, 50th, 51st, 、...、 98th, 99th, until 100th--the 100th step means the car is finished. The rows for the completed car remain in the table, and subsequent processes (such as complaint management, quality assurance, etc.) still need to be used. Car manufacturers produce 10 different models ("TYPE") cars. To simplify the problem, in this example table, various car models are named A, B, C, D 、...、 J. In addition to the primary key index (on the ID column), there is also an index ("I_state") on the State column and an index ("I_type") on the TYPE column. In fact, a "CARS" table contains far more columns than "ID", "state", and "TYPE". For simplicity, no other columns appear in the sample table.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.