[Posting] DB2 partition features

Source: Internet
Author: User
Tags ibm db2
[Posting] DB2 partition features

Data Warehouse scenario overview
Level: Intermediate

Paul McInerney (paulmci@ca.ibm.com), user-centric design expert, DB2 development, IBM

September 18, 2006

This article introduces the following table design features of IBM DB2 for Linux, UNIX, and Windows: Table Partitioning, multi-dimensional clustering (MDC), database partition table, and materialized query table (MQT ). You will also see how these features work together in the data warehouse field and the benefits they bring to query performance, transfer and transfer.
Introduction

In a data warehouse, the size of fact tables or history tables is a challenge for designers and administrators. These tables usually contain hundreds of millions of rows of data, sometimes even hundreds of billions of rows of data. For tables of this scale, we mainly focus on the following:

Query performance
Insert a large amount of new data into these tables
Delete a large amount of outdated data every month or quarter
Over time, DB2 continues to add and enhance features to address these needs. An important enhancement in DB2 9 for Linux, UNIX, and Windows is the Table Partitioning feature. This causes the following problems:

What are these features?
What is the role of each feature in solving the above problems?
What features should I use?
How can we combine these features to enhance the effect?
These are the problems to be solved in this article. After reading this article, you can:

Understand the unique role of each feature in solving related problems.
Understand how to effectively combine these features.
With these backgrounds, readers can further study the details of the features they are interested in.

Features

Three complementary create table options

The create table statement now provides three methods to organize data in a database table.

Table 1. DB2 features
DB2 feature name of the clause in the create table statement
Distribute by hash DPF-database partition feature
Organize by dimension MDC -- multidimensional Clustering
Partition by range TP -- table Partition

You can use these clauses in combination to achieve the desired effect. Table 2 summarizes the terms related to these features. Other features used in this article are listed below.

Table 2. DB2 feature terminology
Other terms used for partitioning data Columns
In earlier versions, the distribution key is called the partition key.
Multi-dimensional clustering (MDC) cells, composed of several blocks to form a dimension block Index
Table partition (TP) Data Partition Table partition key

Brief Comparison

Each feature provides a unique method for grouping table data and has a unique role in solving requirements related to fact tables or historical tables.

DPF is the oldest feature. It can be used to divide a database into multiple database partitions. Each database partition has its own set of computing resources, including CPU and storage. In the DPF environment, each row in the table is distributed to a partition according to the partition key specified in the create table statement. When processing a query, the request is also divided into multiple parts so that each database partition can process the rows it is responsible. In fact, DPF is a scalable feature. DPF can improve the processing capability by adding database partitions. Therefore, as tables grow, it can still maintain high query performance. This capability is often called linear scalability using the non-shared architecture of DB2.

DPF is not only used in table design. It is also a way to adjust and configure the entire database system. Now we have recommended practices for configuring such a system to achieve optimal performance, reliability, and growth. Customers can purchase recommended hardware, software, and configurations in a solution called BCU (balanced configuration unit.

MDC is introduced in DB2 version 8. It can physically aggregate rows with similar values on multiple dimensions on a disk. This aggregation can provide efficient I/O for common analytical queries. For example, for all rows with Product = car, region = east, and salemonthyear = jan09, they can be stored in the same storage location, the so-called block ). When defining a dimension in the create table statement, the storage space is reserved for the combination of each value. In fact, MDC is a feature that maximizes query performance, especially for common queries in data warehouses. This includes selecting a row query based on the combination of values in several columns. For example, date is between "Jan-01-2004" and "Feb-01-2005" and country is not "United States" and Product = "cell phones ".

TP is introduced in DB2 9 and, like MDC, can store rows with an approximate value. However, the following features of TP are not available in MDC:

TP supports partitioning a table into multiple data partitions by one dimension. A common design is to create a Data Partition for each month's data. MDC supports defining multiple dimensions.
Through TP, You can manually define each data partition, including the range of values that will be included in that partition. MDC automatically defines a cell for each unique MDC dimension value combination (and creates a block to store the data of that cell ).
Each TP partition is a separate database object (different from other tables serving as a single database object ). Therefore, TP supports attaching and removing data partitions for TP tables. The partition to be detached is a regular table. In addition, you can place each Data Partition in its own tablespace if necessary.
In fact, unlike other features, TP has the advantage of adding or deleting a large amount of data for a table, that is, transferring and transferring. For readers familiar with the use of Union all view (UAV) to partition historical tables by date, TP can be used as a similar but more advanced solution.

Table 3 summarizes the comparison between these features:

Table 3. Brief Comparison of DB2 features
Features how to organize data advantages
DPF distributes rows evenly across multiple database partitions for scalability-computing resources (that is, database partitions) are increased as the database grows)
MDC places the rows with approximate values on multiple dimensions in the same physical location in the table, that is, the so-called block query performance-the way data is organized is conducive to faster retrieval speed, it is particularly effective for queries with a range specified by multiple predicates.
TP moves all rows in a dimension within a specified range of the same data partition-by adding and deleting the entire data partition, a large amount of data can be added and deleted.

Complementary features

This section describes in detail the previous point of view that the "three friends" are both independent and complementary.

When designing a table, each feature can be considered independent. For example,

Whether MDC and TP are used does not affect the distribution key of DPF.
Whether a column is used as the MDC dimension does not affect whether it should be used as the table partition key, and vice versa. Each decision can be made independently.
The way in which each feature works, such as indexing, will not change with the introduction of new partition features. For example, when MDC is introduced, its indexing does not change the indexing method of DPF. Similarly, when TP is introduced, it does not change the Indexing behavior of DPF or MDC. When learning these features, remembering this helps avoid confusion.

For example, if you encounter the "TP has global indexes" statement when learning TP. Then, you should not infer that DPF's behavior in index processing will change. In such a statement, the term "Global" only means that the index is global for multiple TP data partitions.

Generally, a feature cannot be used to solve the shortcomings or problems related to another feature in database design. Note the following examples:

TP cannot solve problems related to DPF, whether it is DPF data skew or slow management activity in DPF. No matter whether TP is used at the same time, the remedy before DPF still applies.
TP cannot be used to correct bad MDC designs.
In short, if there are issues related to DPF, MDC, or TP, you should try a solution that applies to that feature.

Table Design

Fact tables (or historical tables) in a data warehouse are ideal for each of the above features, as shown in table 4 below.

Table 4. fact tables have features suitable for using the DB2 partition feature
Features suitable for table feature fact table features
DPF large tables-large tables cannot be processed only by a single set of CPU and I/O channels. fact tables are the largest database tables. They often contain hundreds of millions of rows of data, sometimes even hundreds of billions of rows of data.
The MDC result set returns the query fact table (and generally referred to as a data warehouse) of rows with an approximate value on multiple dimensions. It is designed to support this type of query.
For a table of the TP type, a large amount of data is added periodically, and a large amount of data is deleted after the data expires. In fact tables, new data is often added every day. Old data is usually deleted on a monthly or quarterly basis.

Empirical rules for table Design

This section gives you a perceptual understanding of the nature of design decisions (or just perceptual knowledge) and provides some empirical rules. The references listed later provide a more comprehensive design guide.

For DPF, when selecting a distribution key, you should first choose the columns that can evenly distribute data rows across multiple database partitions. If this condition is not met, data skew will occur. This means that one or more database partitions carry a heavier proportion of table rows, resulting in performance bottlenecks. It is a good choice for columns with many different values. Another consideration is to select columns that can maximize the join performance.

Another DPF design determines the number of database partitions. The number of database partitions is not considered in table design. In fact, it is considered in the design of the entire system. It must be determined based on the expected raw data size of the entire database and the capabilities of the server hardware. Many systems require no more than 20 database partitions. However, the largest system may need more database partitions. Given the increasing trend of data warehouses, the number of database partitions is expected to increase.

For MDC, a key decision is to use the columns as the MDC dimension. The challenge in design is to find the best set of dimensions and the best granularity to maximize the organization and minimize the storage requirements. A better choice is a column with one or all of the following features:

Used for range, equals, or in list predicates
Used for transferring in, transferring out, or other large-scale row Deletion
Referenced by group by or order by clause
Foreign key column
Columns in the join clause of a fact table in a star database
Coarse granularity, that is, columns with few different values
A typical design is to use a column that represents a date as an MDC dimension, plus 0 to 3 other columns as other dimensions, such as region and product_type.

For TP, the design decision includes selecting the number of columns and partitions used as the table partition key. Generally, the table partition key is a time-based column. Each partition is consistent with the data volume transferred each time. For example, a table that transfers data every month has a partition for each month. During the design of TP, you must take into account rows that are not within the value range defined in the create table statement.

For databases whose data needs to be transferred out based on sale_date every month, a typical design is to use sale_date as the table partition key and create a separate partition for each month.

Generally, an MDC dimension is a time-based column. In this way, the same column can be used for both MDC and TP. The MDC granularity can be finer than that of the TP database partition.

The following table summarizes the above points.

Table 5. Summary of design rules
Partition feature design determines rule of thumb
DPF-the first choice for columns used as distribution keys is columns with many different values.
MDC-a typical design of columns used as MDC dimensions is to select a column that represents the date, plus 0 to 3 other columns, such as region and product_type.
TP -- select a time-based column for the number of columns and partitions used as the table partition key. Define partitions that match the volume of data transferred each time

Design Example

Table 6 shows some typical table design examples. Transactions historical tables represent a typical table in the relational data warehouse. The recent transactions table represents a table in the operational data storage, which is actually a data warehouse with only recent data.

Table 6. Table Design Example
Partition attribute transactions History Table recent transactions table
DPF -- transaction ID of the Column Used as the distribution key
DPF-number of database partitions 20 4
MDC -- used as the dimension column Transaction date (year + month) = 36 values (see note 1); Account type = 5 values; State = 51 values Transaction date (days) = 90 values; Account type = 5 values; State = 51 values
TP -- number of columns and partitions used as the table partition key Transaction date (year + month) = 1 partition per month Transaction date (year + month) = 1 partition per month
Other table attributes ----
# Of rows (1 million per day) 1 billion 90 million
# Of columns 30 30
# Of indexes 4 15

NOTE 1: For the MDC dimension of the transactions History Table, another design scheme is to define the transaction date at a finer granularity, such as every week or every day. The balance between better query performance and increased storage requirements depends on the data and query features.

Add MQT

MQT Overview

After clarifying the differences and complementary relationships between the above three features, it is necessary to further expand the discussion and talk about MQT (materialized query table ). MQT and partition features are designed to be used in the same situation (that is, fact tables or historical tables in the data warehouse. Therefore, to fully consider how to use the partition feature mentioned in this article, we need to solve some special considerations when MQT is involved.

MQT is a table defined based on the query results. From another perspective, MQT is like a view where the result set is stored in a table. MQT can improve the response time of complex queries in the following aspects:

Clustering or computing on data in basic tables
Join of Basic Tables
A part of data that is frequently accessed in one or more large basic tables.
When the data in the basic table changes, the MQT needs to be updated accordingly. The MQT feature provides multiple options for updating to meet various operational needs.

Comparison with partition features

The following table adds a row about MQT to table 3.

Table 7. Brief Comparison of DB2 features, including MQT
Features how to organize data advantages
DPF distributes rows evenly across multiple database partitions for scalability-computing resources (that is, database partitions) are increased as the database grows)
MDC places the rows with approximate values on multiple dimensions in the same physical location in the table, that is, the so-called block query performance-the way data is organized is conducive to faster retrieval speed, it is particularly effective for queries with a range specified by multiple predicates.
TP moves all rows in a dimension within a specified range of the same data partition-by adding and deleting the entire data partition, a large amount of data can be added and deleted.
MQT stores the query results in a table for query performance. For operations that involve high costs, such as complex joins and table scan queries, the MQT computes and stores (materialized) The result sets in advance) result set

Design MQT together with partition features

When designing MQT together with the partition feature, the design considerations can be summarized as follows:

You can create an MQT on any table that combines the partition feature. For example, you can create an MQT on one or more tables that use MDC and TP. You do not need to consider whether MQT will be created on the basic table partition feature in the future. However, the design of MQT may be affected by the partition feature used in basic tables. For example, if the basic table uses DPF for partitioning, The MQT design should consider whether to copy MQT on each database partition.
MQT can also use the partition feature. For example, you can use MDC or TP to partition MQT.
Typical Design

Next we will further introduce the previous example of transactions history tables. Below are some mqts defined on these tables:

MQT 1-transaction totals per day per account type
MQT 2-year-to-date totals per State
Key considerations: query performance

Description of this consideration

Now let's take a look at the key considerations when evaluating and using these features: performance, especially the performance of user queries in Common Data Warehouse businesses. These queries have the following features:

Select the rows that meet the standards on several dimensions from the fact table. This means that several dimension tables should be joined with fact tables.
Use grouping or clustering functions, such as Count, group by, and order.
The returned result set contains multiple rows, from thousands to millions.
These queries are generated by users or their bi tools. This means that these queries are more temporary, and performance tests and tuning methods in the transaction processing system are not suitable for them.
Although people tend to think of faster performance, it is better to say better performance. When talking about performance, it should include the following aspects:

Peak query execution performance
Stability of query execution performance
Performance of various workloads with different characteristics in the data warehouse
Is it easy to design databases to achieve performance goals?
The price required to achieve performance goals
Compared with the past, the development trend of hardware needs to be considered in the process of designing databases to achieve performance goals. I/O bandwidth is a potential performance bottleneck as the CPU processing capacity increases and the storage capacity expands. In this environment, I/O efficiency is a key consideration during design.

This section describes the functions of each partition feature on query execution performance. In the subsequent sections, we will also talk about the transfer and transfer performance.

DB2 partitioning features play a role

DPF provides more computing resources than DPF, which has a positive impact on performance. When the DB2 optimizer forms a query access plan for a query, it divides the work into multiple database partitions, which work in parallel. Then, collect the results of each database partition and return the results to the query submitter.

MDC contributes to the performance by improving the efficiency of data retrieval. Data with approximate values on multiple dimensions is stored in the same location, which makes I/O operations more efficient, while I/O operations are a common bottleneck in data warehouses. In addition, the MDC feature also includes a block index. In a block index, there is an entry for each block of data (rather than each row of data. This makes the index operation more efficient. To maximize its potential performance, you must design an optimal (or at least good enough) set of dimensions for the MDC table. MDC is only good for queries that include dimension columns. MDC is completely transparent to queries. Finally, MDC has two notable advantages in terms of management:

The MDC block index means fewer rid indexes are required. One advantage of management is that the storage space used for indexing is reduced.
Since new rows are inserted near rows with approximate values in the table, data is still aggregated without running the reorg utility.
TP improves query performance through partition exclusion. For example, assume that the transactions History Table has 36 partitions and each month corresponds to one partition. For a query that selects the data of the past 12 months, the optimizer knows that it does not have to scan the data of other partitions beyond the past 12 months. This partition exclusion applies to both index scanning and table scanning. TP only applies to queries that contain table partition key columns.

Key considerations: Insert new data

Description of this consideration

Insert new data from the operating system into the fact table in the warehouse. This process is called ETL, ingest, data warehouse filling, or transfer. The following example demonstrates various possible situations.

Example 1: Use load to consume daily data

After the end of the business day, the operating system will contain more than 0.5 million to 0.2 billion million records.
A client script loads each file to a fact table using the DB2 load utility. This is done when the table is offline in a batch processing window every night.
At the beginning of the next business day, the user who queries this table can see the data of the previous day.
Example 2-quasi-real-time uptake Using Insert

Every 30 minutes, a file containing 10 thousand to 0.1 million records will arrive.
When the file arrives, a program compiled by the user adds these records to a staging table, and then adds these records to the fact table using the insert statement.
Example 3-MQT refresh

When mqts exist, they are considered part of the process of adding new data to a data warehouse. In this case, you can use the MQT to refresh the policy. In general, the ETL process is to manually specify when to update the MQT, instead of making the update happen automatically.

For example 1, it is likely to be updated immediately after all the tasks updated each night are completed.
For example 2, The MQT is usually updated every day. Therefore, even if the underlying data is updated cyclically, all queries accessing the MQT return the same results throughout the day.
DB2 partitioning features play a role

The DB2 partition feature can be helpful for transferring in, but sometimes it may bring new circumstances. The customer should take this into consideration during the transfer process.

DPF allows you to add data more quickly, because each database partition can work in parallel. On the other hand, DPF needs to consider sending rows to appropriate data partitions.

Compared with the absence of MDC, MDC can improve the transfer process. Its advantages include:

Less physical I/O: the MDC table has fewer rid indexes, so less physical I/O is required to update the index during the transfer.
Faster insert: the MDC table reduces page contention and locks, so it helps to execute insert using multiple parallel streams.
Concurrent business queries have better performance: the MDC table reduces the page contention lock, which is also conducive to the performance of concurrent business queries.
If MDC is used, we recommend that you sort data by MDC dimension in advance.

In some cases, TP facilitates the transfer operation. TP allows you to Add rows to a partition, and then attach the partition to the table when preparing. However, in this example, this option is not applicable. Remember, our example table (transactions History Table) has a separate partition for each month, and we add data once or multiple times a day. In this case, you need to add a blank partition for the table before the beginning of a month, that is, before you start to add data on a daily basis.

Finally, MQT also adds considerations during the transfer process. In particular, you need to decide when to update the MQT.

Key considerations: data deletion

Description of this consideration

After the data is stored in the data warehouse for a period of time, it is no longer useful for business users, so they need to be deleted to free up space for new data. This process is called transfer, purging, and archiving. The following example shows different situations.

Generally, the transfer involves the following business rules and how to use the DB2 partition feature:

Delete the row to a certain age: This is the simplest and most common business requirement. In traditional history tables, the period is generally 36 months. For recent history tables, the period is generally 60 to 180 days.
Rows of a certain age are deleted according to business rules: in this case, some rows still need to be retained even though they reach the general retirement age. For example, a historical transaction may need to be retained to provide evidence for a dispute or investigation.
Data can still be accessed, but storage is released: this situation is sometimes called archiving, rather than transferring out. In this case, the rows must be visible to the user query, but these seldom accessed rows must be transferred to cheaper storage with lower performance. This business requirement can be solved through tools such as Tivoli Hierarchical Storage Manager (HSM.
Generally, MQT needs to be put into consideration. Generally, the MQT needs to be updated to delete the corresponding summary data. For example, if you have deleted data from the fact table in April March 2003, you also need to delete the summary data for that month in MQT.

DB2 partitioning features play a role

To support outbound transfer, DB2 partition features have some notable features for different business needs.

Let's talk about DPF. This feature has little effect on outbound transfer. The transfer-out operation is not much different when DPF is used than when DPF is not used.

Both MDC and tp bring benefits to the transfer-out operation. In any DB2 version, one feature may provide better transfer performance than the other, but over time, these two features should be roughly the same. When comparing these features, it is more important to look at the unique advantages of each feature in some outbound situations.

For basic and common transfer cases (that is, only the rows of a certain age are deleted), TP is obviously the first choice. If TP is used, the transfer can be completed through a simple detach operation. In addition, TP is the only feature suitable for the following situations:

Move the transferred data to another location (that is, to move it to a table or database), not just to delete it.
Use tools such as Tivoli HSM to transfer older and infrequently accessed rows to cheaper storage, but you can still see them in your queries.
MDC is the only feature suitable for the following situations. Compared with basic transfer-out, the following situations require higher requirements, but are not common:

The customer wants to execute the transfer during the concurrent query activity, but they cannot accept the effect of zlock temporarily required by TP during the detach operation.
Customers do not want to modify their applications or scripts. That is to say, they do not want to replace their delete statement with the TP detach statement.
The customer wants to delete a large amount of data on dimensions other than the time dimension (the table is partitioned by data in this column.
The customer wants to delete the row to a certain age according to the business rules. In this case, they can issue a SELECT statement to identify qualified rows and then delete those rows.
For MQT, When deleting the corresponding summary data from MQT, we recommend that you use table partitions on MQT and use the same data partitions as the basic table definitions.

Other options for data deletion

Although transfer-out is a common method to delete outdated data, it should be noted that customers sometimes use other methods to delete data, which do not need to use the partition feature. These methods include:

Refresh table: In some data warehouses, the entire table is deleted once a year and an alternative table is loaded. The new table contains all the data except unnecessary data.
Cleaning: In some recent history tables, the entire table is deleted every day and the table is re-created. The data is placed in another table with a longer history. Then, the empty table can be re-created to consume new data until the cleaning day arrives.
Conclusion

This article introduces the following DB2 table design features: Table Partitioning, MDC, DPF, and MQT. These features work together to solve issues that customers are concerned about in terms of query performance, insertion of new data, and deletion of old data. The following table summarizes how these DB2 features address various customer needs.

Table 8. How to address customer needs with DB2 features
Advantages of customer requirements
Query performance each feature contributes to improving query performance in its own way. Using more features will lead to better performance.
In most customer scenarios, MDC can bring the greatest benefits. TP can bring benefits in some uncommon situations.
Transfer out can bring the greatest benefit to simple and common transfer out scenarios. MDC is suitable for handling other transfer situations where TP is not suitable.

References

Learning

For more information, see the original article on the developerworks global site.

IBM redbooks up and running with DB2 UDB ese: partitioning for performance in an e-Business Intelligence world gives more in-depth discussions on DPF, MDC, and MQT, I also talked about many other aspects of optimizing DB2 for business intelligence.

IBM Redbook DB2 UDB's high-function Business Intelligence in e-business makes a good discussion of MQT in Chapter 2nd.

The MDC Research and Development Group Home Page provides links to various MDC publications and introductions.

Visit the business intelligence area on the developerworks Chinese website, read articles and tutorials there, and explore other references to expand your business intelligence skills.

Visit the developerworks resource page for DB2 for Linux, UNIX, and windows, read the articles and tutorials there, and explore other references to expand your DB2 skills.

Obtain products and technologies

Download the free trial version of DB2 enterprise 9.

Build your next development project with IBM trial software that can be downloaded directly from developerworks.

About the author

Paul McInerney participated in the design of some of the features described in this article. He also conducted extensive interviews with data warehouse customers to learn about their use of these features and other features of DB2.
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.