The example demonstrates using the IBM BCU design architecture to benchmark TPC as data source (300GB data volume) and test case, showing the pull effect of "troika" on query performance. Whether in the POC test or in the real production system, query performance is an important indicator of customer concern. Through this article, the reader can fully understand the "troika" of the mystery, the text of the example demo to the reader has reference and referential significance.
In the http://www.aliyun.com/zixun/aggregation/8302.html "> data warehousing field, whether in a production system or a POC (Proof of Concept) performance test, Query performance is a very important performance metric for customers. Good query performance lays the foundation for the efficient operation of various data warehouse applications. As for query performance, it is well known that the main performance bottleneck is from system I/O, so this paper, from the point of view of physical design of Data Warehouse, expounds three key technologies which affect query performance, and demonstrates the effect of performance improvement with benchmark test TPC.
On the partition database, table partitioning and multidimensional Clustering (MDC), DeveloperWorks has many excellent articles on its basic principles and characteristics respectively, this article will not repeat. This article focuses on the impact of the three technologies on query performance in physical design.
Theoretical
Partitioned databases (database partitioning Feature)
The share-nothing architecture in a partitioned database allocates heavy and time-consuming system I/O jobs to each node in the cluster, combining the SAN (Storage area receptacle) storage network to take full advantage of the I/O performance of the disk controller and the bandwidth of the storage network.
In order to balance the I/O busy level of each node, the balanced data distribution, it is particularly important. The distribution of data depends on the data itself and the selection of database partitioning keys, and the selection of database partitioning keys should follow the following guidelines:
1. A single, more decentralized column;
2. Columns often used in connection (join);
In the case of a balanced distribution of data, it is possible to avoid a node being overloaded with I/O due to processing too much data, thus becoming the bottleneck of the whole cluster.
In a stand-alone database environment, query processing can only take advantage of the system resources (CPU, MEMORY,I/O), when the data stored in a single Zhang, BI (Business FDI) query usually need access to most of the data in the table, such as for the query SQL1, In a stand-alone environment, the physical distribution of the data is shown in Figure 1. In the absence of an index, the database needs to scan the entire large table to query for eligible records, and it is not difficult to imagine the impact of busy I/O on the overall performance of the query as required for scanning large tables.
Listing 1. Query SQL1
SELECT C_name, C_total_spend, c_loyalty_tier from the CUSTOMER where c_region = ' North America ' and c_month= ' procures ' and c_type= ' VIP '
(Note: The blue triangle represents the data that matches the query criteria, i.e. c_region= ' North America ' and c_month= ' procures ' and c_type= ' VIPs ')
Figure 1. Physical distribution of data in stand-alone database environment
With a multiple-partition database, the data can be distributed evenly across the nodes in the cluster, although the BI query still needs to scan the entire large table and read most of the data, but the query can be parallel to all nodes, as shown in Figure 2.
Figure 2. Physical distribution of data in partitioned database environment