The query optimization in the parallel version of DB2 reaches the peak!

Source: Internet
Author: User

This article describes how to optimize queries in the DB2 parallel version, at the same time, this article also describes the query optimization and query plan generation, the parallelization of all operations, and the issues that need to be paid attention to in subquery processing. The following is the main content of this article.

Query

1. Introduction

SN, GAMMA, BUBBA,

SM, XPRS

DB2 Parallel Edition (DB2 PE) runs on an AIX Parallel Machine, such as SP2. It is based on the SN architecture and the Function Shipping execution model, providing scalability and big processing capabilities. The cost-based optimization takes into account CPU, IO, and message transmission. To ensure ACID properties, parallel tools include:

Load, Import, Reorganize Data, Create Index. The Redistribute tool also provides a parallel data re-organization tool to effectively correct and handle load imbalance. Function shipping refers to performing data operations at the data storage location and filtering necessary data to reduce the amount of data transmitted between networks. The system sets the service coordinator to accept SQL statements entered by users, distribute query execution tasks, and return final results. In addition, you need to do the following:

Generate query execution plans, data and control flows, process management, parallel transactions and blocking management, and parallel tools in the parallel DB2 parallel version.

Supports HASH data sharding and extends DDL:

 
 
  1. CREATE NODEGROUP GROUP_1 ON ONODES(1 TO 32,40,45,48);   
  2. CREATE TABLE PARTS(Partkey integer, Partno integer) IN GROUP_1   
  3. PARTITIONING KEY(Partkey) USING HASHING;   
  4. CREATE TABLE PARTSUPP(Partkey integer, Suppkey integer) IN GROUP_1   
  5. PARTITIONING KEY(Partkey) USING HASHING;  

The HASH function is built in the system. Tuples with the same attribute values in different table partition columns are divided into the same node. Such a table is called collocated. Therefore, the equivalent join in these tables is called collocated joins. For example, PARTSx (Partkey = Partkey) PARTSUPP.

2. Query Optimization

(1) Cost Optimization-taking into account the inherent concurrency of operations and the cost of message communication;

(2) Comprehensive utilization of data distribution information-Use of data distribution and sharding information of base tables and intermediate result tables;

(3) Transparent concurrency-almost no need to rewrite the query.

2.1 extended Operators

The system can use the basic operations of the serial version, such as Scan, but also need to expand:

(1) Operations used to control the coordinated execution of multiple subtasks-Coordinator;

(2) send/receive for inter-process communication operations. send can be broadcast or unicast, and receive can be multiple sorted inputs of merge or FIFO.

2.2 sharding Method

The partitioning method of DB2 can be considered as an effective load balancing tool. The optimizer makes full use of shard and node group information to optimize queries, such as collocated joins.

2.3 query optimization and query plan generation

The query optimizer does the following work: 1) Select the optimized connection order, 2) determine the access method and connection method of the basic table, and 3) determine the execution node of the operation, for example, resharding of data. 4) system resource consumption and response time must be taken into account for query execution. DB2 PE Adopts the simplified rules:

1) when the query execution plan of the DB2 parallel version is generated from the bottom up, the system resources occupied by each node are accumulated, and the maximum value is used as the system response time. The execution cost of the Coordinator and the complexity of multi-process interference are ignored;

2) When determining the execution completion point in any node subset that may execute the connection, only some of the nodes are considered, such as the node where the table shards are located, and the node where the external table shards are located, and other nodes are not described in detail ). This simplifies the search space generated by the Plan and keeps it linear in proportion to the serial search space.

The connection operations include collocated, directed, broadcast, and repartitioned. Directed join refers to connecting to the node where an input link is located, and sending the tuples of another link to an appropriate node, for example, equivalent join on the shard attribute of an input link ). Repartitioned join refers to the re-distribution of input connections, for example, during equivalent join. Broadcat join refers to broadcast the tuples of one link to the node group of the other link before the connection, which can correspond to any type of connection.

Cost-based Optimization-two-stage optimization is no longer suitable. Data Partitioning and placement have a great impact on the selection of query plans.

2.4 parallel operation of all operations

1. aggregation operation-execute the aggregation function in each subtask and perform the final processing globally if necessary. For the Group By sub-bureau, resharding may be more parallel than directly using input data.

2. Set Operations-use the collocated and repartitioned policies. Note that UNION can be an N-Yuan operation.

3. Insert statements with subqueries, Update and Delete statements -- Insert statements and subqueries may be collocated. Or use the directing method. Update and Delete are always consistent with the corresponding query collocated Update. If a shard attribute occurs, you also need to consider the affected tuples of the shard ).

4. Outer Join-the main policy is the same as that for normal connections, but avoid generating multiple result tuples because there are no matching tuples.

5. subquery-Send the subquery results to the nodes that calculate the predicates containing the subquery using the collocated, directed, and broadcast methods.

3. subquery Processing

Example of nested query:

 
 
  1. select *  
  2. from t1  
  3. where t1.a in (select b   
  4. from t2   
  5. where t2.c = t1.a and t2.d in (select b   
  6. from t3   
  7. where t3.a = 10   
  8. ));  

The innermost subquery only needs to be executed once. However, the following query is not:

 
 
  1. select *  
  2. from t1   
  3. where t1.a in ( select b   
  4. from t2   
  5. where t2.c = t1.a and t2.d in ( select b   
  6. from t3   
  7. where t3.a = t1.f   
  8. ));   

Subquery processing requires attention to the following issues:

(1) Avoid starting a query in the sub-DB2 parallel version for the tuples generated for each external query. This is too costly;

(2) ensure that different nodes that generate the outer query tuples are not serialized due to the execution of the subquery;

(3) ensure that the correct external query node is executed;

(4) Move the predicates as much as possible to reduce data transmission.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.